Data Preprocessing

Data Preprocessing

So far, we have been working with synthetic data that arrived in ready-made tensors. However, to apply deep learning in the wild, we must extract messy data stored in arbitrary formats and preprocess it to suit our needs. Fortunately, the pandas library can do much of the heavy lifting. This section will give you a crash course on some of the most common routines.

Reading the Dataset

Comma-separated values (CSV) files are ubiquitous for the storing of tabular (spreadsheet-like) data. In them, each line corresponds to one record and consists of several fields separated by commas.

To demonstrate how to load CSV files with `pandas`, we create a CSV file below at `../data/house_tiny.csv`. This file represents a dataset of homes with columns for the number of rooms (`NumRooms`), the roof type (`RoofType`), and the price (`Price`).

import os
 
os.makedirs(os.path.join('..', 'data'), exist_ok=True)
data_file = os.path.join('..', 'data', 'house_tiny.csv')
with open(data_file, 'w') as f:
    f.write('''NumRooms,RoofType,Price
NA,NA,127500
2,NA,106000
4,Slate,178100
NA,NA,140000''')

Now let's import `pandas` and load the dataset with `read_csv`.

import pandas as pd
 
data = pd.read_csv(data_file)
print(data)

Data Preparation

In supervised learning, we train models to predict a designated target value given some set of input values. Our first step is to separate these. We can select columns by name or via integer-location based indexing (`iloc`).

Handling Missing Values

You might have noticed that `pandas` replaced all `NA` entries with `NaN` (not a number). These are missing values. Depending upon the context, missing values are handled via imputation (replacing them) or deletion (discarding them).

1. Categorical Imputation For categorical input fields, we can treat `NaN` as a distinct category. By using one-hot encoding, `pandas` converts a column like `RoofType` into multiple columns such as `RoofType_Slate` and `RoofType_nan`.

inputs, targets = data.iloc[:, 0:2], data.iloc[:, 2]
inputs = pd.get_dummies(inputs, dummy_na=True)
print(inputs)

2. Numerical Imputation For missing numerical values, a common heuristic is to replace `NaN` entries with the mean value of the corresponding column.

inputs = inputs.fillna(inputs.mean())
print(inputs)

Conversion to the Tensor Format

Now that all the entries in `inputs` and `targets` are numerical, we can load them into a tensor format for our deep learning framework.

from mxnet import np
X, y = np.array(inputs.to_numpy(dtype=float)), np.array(targets.to_numpy(dtype=float))

import torch
X = torch.tensor(inputs.to_numpy(dtype=float))
y = torch.tensor(targets.to_numpy(dtype=float))

import tensorflow as tf
X = tf.constant(inputs.to_numpy(dtype=float))
y = tf.constant(targets.to_numpy(dtype=float))

from jax import numpy as jnp
X = jnp.array(inputs.to_numpy(dtype=float))
y = jnp.array(targets.to_numpy(dtype=float))

Discussion

While this crash course kept things simple, real-world data processing involves myriad data types beyond categorical and numeric, such as text strings, images, and audio. Moreover, datasets are often plagued by outliers and recording errors.

Data visualization tools such as seaborn, matplotlib, or Bokeh can help you manually inspect data and develop intuitions about the problems you may need to address.

Exercises

1. Try loading a dataset from the UCI Machine Learning Repository (e.g., Abalone) and inspect its properties.
2. Try indexing and selecting data columns by name rather than by column number.
3. How large a dataset do you think you could load this way? What are the limitations regarding memory footprint?
4. How would you deal with data that has a very large number of categories?
5. What alternatives to pandas exist? Check out **Pillow** for images or loading **NumPy** tensors directly from a file.

[Discussions](https://discuss.d2l.ai/t/28) [Notebook](https://colab.research.google.com/github/d2l-ai/d2l-pytorch-colab/blob/master/chapter_preliminaries/pandas.ipynb)

Table of Contents