So far, we have been working with synthetic data that arrived in ready-made tensors. However, to apply deep learning in the wild, we must extract messy data stored in arbitrary formats and preprocess it to suit our needs. Fortunately, the pandas library can do much of the heavy lifting. This section will give you a crash course on some of the most common routines.
Comma-separated values (CSV) files are ubiquitous for the storing of tabular (spreadsheet-like) data. In them, each line corresponds to one record and consists of several fields separated by commas.
To demonstrate how to load CSV files with `pandas`, we create a CSV file below at `../data/house_tiny.csv`. This file represents a dataset of homes with columns for the number of rooms (`NumRooms`), the roof type (`RoofType`), and the price (`Price`).
import os os.makedirs(os.path.join('..', 'data'), exist_ok=True) data_file = os.path.join('..', 'data', 'house_tiny.csv') with open(data_file, 'w') as f: f.write('''NumRooms,RoofType,Price NA,NA,127500 2,NA,106000 4,Slate,178100 NA,NA,140000''')
Now let's import `pandas` and load the dataset with `read_csv`.
import pandas as pd data = pd.read_csv(data_file) print(data)
In supervised learning, we train models to predict a designated target value given some set of input values. Our first step is to separate these. We can select columns by name or via integer-location based indexing (`iloc`).
You might have noticed that `pandas` replaced all `NA` entries with `NaN` (not a number). These are missing values. Depending upon the context, missing values are handled via imputation (replacing them) or deletion (discarding them).
1. Categorical Imputation For categorical input fields, we can treat `NaN` as a distinct category. By using one-hot encoding, `pandas` converts a column like `RoofType` into multiple columns such as `RoofType_Slate` and `RoofType_nan`.
inputs, targets = data.iloc[:, 0:2], data.iloc[:, 2] inputs = pd.get_dummies(inputs, dummy_na=True) print(inputs)
2. Numerical Imputation For missing numerical values, a common heuristic is to replace `NaN` entries with the mean value of the corresponding column.
inputs = inputs.fillna(inputs.mean()) print(inputs)
Now that all the entries in `inputs` and `targets` are numerical, we can load them into a tensor format for our deep learning framework.
from mxnet import np X, y = np.array(inputs.to_numpy(dtype=float)), np.array(targets.to_numpy(dtype=float))
import torch X = torch.tensor(inputs.to_numpy(dtype=float)) y = torch.tensor(targets.to_numpy(dtype=float))
import tensorflow as tf X = tf.constant(inputs.to_numpy(dtype=float)) y = tf.constant(targets.to_numpy(dtype=float))
from jax import numpy as jnp X = jnp.array(inputs.to_numpy(dtype=float)) y = jnp.array(targets.to_numpy(dtype=float))
While this crash course kept things simple, real-world data processing involves myriad data types beyond categorical and numeric, such as text strings, images, and audio. Moreover, datasets are often plagued by outliers and recording errors.
Data visualization tools such as seaborn, matplotlib, or Bokeh can help you manually inspect data and develop intuitions about the problems you may need to address.
1. Try loading a dataset from the UCI Machine Learning Repository (e.g., Abalone) and inspect its properties. 2. Try indexing and selecting data columns by name rather than by column number. 3. How large a dataset do you think you could load this way? What are the limitations regarding memory footprint? 4. How would you deal with data that has a very large number of categories? 5. What alternatives to pandas exist? Check out **Pillow** for images or loading **NumPy** tensors directly from a file.
[Discussions](https://discuss.d2l.ai/t/28) [Notebook](https://colab.research.google.com/github/d2l-ai/d2l-pytorch-colab/blob/master/chapter_preliminaries/pandas.ipynb)