====== Data Preprocessing ======

So far, we have been working with synthetic data that arrived in ready-made tensors. However, to apply deep learning in the wild, we must extract messy data stored in arbitrary formats and preprocess it to suit our needs. Fortunately, the //pandas// library can do much of the heavy lifting. This section will give you a crash course on some of the most common routines.

===== Reading the Dataset =====

Comma-separated values (CSV) files are ubiquitous for the storing of tabular (spreadsheet-like) data. In them, each line corresponds to one record and consists of several fields separated by commas.

To demonstrate how to load CSV files with `pandas`, we **create a CSV file below** at `../data/house_tiny.csv`. This file represents a dataset of homes with columns for the number of rooms (`NumRooms`), the roof type (`RoofType`), and the price (`Price`).

<code python>
import os

os.makedirs(os.path.join('..', 'data'), exist_ok=True)
data_file = os.path.join('..', 'data', 'house_tiny.csv')
with open(data_file, 'w') as f:
    f.write('''NumRooms,RoofType,Price
NA,NA,127500
2,NA,106000
4,Slate,178100
NA,NA,140000''')
</code>

Now let's import `pandas` and load the dataset with `read_csv`.

<code python>
import pandas as pd

data = pd.read_csv(data_file)
print(data)
</code>

===== Data Preparation =====

In supervised learning, we train models to predict a designated //target// value given some set of //input// values. Our first step is to separate these. We can select columns by name or via integer-location based indexing (`iloc`).


==== Handling Missing Values ====

You might have noticed that `pandas` replaced all `NA` entries with `NaN` (//not a number//). These are **missing values**. Depending upon the context, missing values are handled via //imputation// (replacing them) or //deletion// (discarding them).

**1. Categorical Imputation**
For categorical input fields, we can treat `NaN` as a distinct category. By using //one-hot encoding//, `pandas` converts a column like `RoofType` into multiple columns such as `RoofType_Slate` and `RoofType_nan`.


<code python>
inputs, targets = data.iloc[:, 0:2], data.iloc[:, 2]
inputs = pd.get_dummies(inputs, dummy_na=True)
print(inputs)
</code>

**2. Numerical Imputation**
For missing numerical values, a common heuristic is to replace `NaN` entries with the **mean value** of the corresponding column.

<code python>
inputs = inputs.fillna(inputs.mean())
print(inputs)
</code>

===== Conversion to the Tensor Format =====

Now that all the entries in `inputs` and `targets` are numerical, we can load them into a tensor format for our deep learning framework.

<code python>
from mxnet import np
X, y = np.array(inputs.to_numpy(dtype=float)), np.array(targets.to_numpy(dtype=float))
</code>

<code python>
import torch
X = torch.tensor(inputs.to_numpy(dtype=float))
y = torch.tensor(targets.to_numpy(dtype=float))
</code>

<code python>
import tensorflow as tf
X = tf.constant(inputs.to_numpy(dtype=float))
y = tf.constant(targets.to_numpy(dtype=float))
</code>

<code python>
from jax import numpy as jnp
X = jnp.array(inputs.to_numpy(dtype=float))
y = jnp.array(targets.to_numpy(dtype=float))
</code>

===== Discussion =====

While this crash course kept things simple, real-world data processing involves myriad data types beyond categorical and numeric, such as text strings, images, and audio. Moreover, datasets are often plagued by outliers and recording errors. 

Data visualization tools such as **seaborn**, **matplotlib**, or **Bokeh** can help you manually inspect data and develop intuitions about the problems you may need to address.

===== Exercises =====

  1. Try loading a dataset from the UCI Machine Learning Repository (e.g., Abalone) and inspect its properties.
  2. Try indexing and selecting data columns by name rather than by column number.
  3. How large a dataset do you think you could load this way? What are the limitations regarding memory footprint?
  4. How would you deal with data that has a very large number of categories?
  5. What alternatives to pandas exist? Check out **Pillow** for images or loading **NumPy** tensors directly from a file.

[Discussions](https://discuss.d2l.ai/t/28)
[Notebook](https://colab.research.google.com/github/d2l-ai/d2l-pytorch-colab/blob/master/chapter_preliminaries/pandas.ipynb)