Table of Contents

Suggested Notation for Machine Learning

Data & Domains

Symbol Meaning
$\mathbf{x}$ Input instance (usually $\in \mathbb{R}^d$)
$\mathbf{y}$ Output / Label (usually $\in \mathbb{R}^{d_\text{o}}$)
$\mathbf{z}$ Example pair $(\mathbf{x}, \mathbf{y})$
$d$ Input dimension
$d_{\text{o}}$ Output dimension
$n$ Number of samples
$\mathcal{X}$ Instance domain (set)
$\mathcal{Y}$ Label domain (set)
$\mathcal{Z}$ Example domain ($\mathcal{X}\times\mathcal{Y}$)
$\mathcal{D}$ Distribution over $\mathcal{Z}$
$S$ Dataset sample $\{(\mathbf{x}_i,\mathbf{y}_i)\}_{i=1}^n$

Functions & Models

Symbol Meaning
$\mathcal{H}$ Hypothesis space
$f_{\mathbf{\theta}}$ Hypothesis function (Model) $f: \mathcal{X}\to\mathcal{Y}$
$\mathbf{\theta}$ Set of model parameters
$f^*$ Target function (Ground truth)
$\sigma$ Activation function (e.g., ReLU, sigmoid)
$\ell$ Loss function $\ell(f_{\mathbf{\theta}}(\mathbf{x}), \mathbf{y})$

Training & Complexity

Symbol Meaning
$L_S(\mathbf{\theta})$ Empirical Risk (Training Loss) on set $S$
$L_\mathcal{D}(\mathbf{\theta})$ Population Risk (Expected Loss)
$\eta$ Learning rate
$B$ Batch set
$|B|$ Batch size
$\text{GD}$ Gradient Descent
$\text{SGD}$ Stochastic Gradient Descent
$\text{VCdim}(\mathcal{H})$ VC-dimension of hypothesis class
$\text{Rad}_S(\mathcal{H})$ Rademacher complexity on $S$

Neural Network Specifics

Symbol Meaning
$m$ Number of neurons in a hidden layer
$L$ Total number of layers (excluding input)
$\mathbf{w}_j, \mathbf{b}_j$ Weights and bias for specific neuron $j$
$\mathbf{W}^{[l]}$ Weight matrix for layer $l$
$\mathbf{b}^{[l]}$ Bias vector for layer $l$
$f^{[l]}$ Output of layer $l$
$\circ$ Entry-wise operation (Hadamard product)
$*$ Convolution operation

Key Formula Reference

Empirical Risk: $$ L_S(\mathbf{\theta})=\frac{1}{n}\sum^n_{i=1}\ell(f_{\mathbf{\theta}}(\mathbf{x}_i),\mathbf{y}_i) $$

2-Layer Network: $$ f_{\mathbf{\theta}}(\mathbf{x})=\sum^m_{j=1}a_j\sigma(\mathbf{w}_j\cdot\mathbf{x}+b_j) $$

General Deep Network (Recursive): $$ f^{[l]}_{\mathbf{\theta}}(\mathbf{x})=\sigma\circ(\mathbf{W}^{[l-1]}f^{[l-1]}_{\mathbf{\theta}}(\mathbf{x})+\mathbf{b}^{[l-1]}) $$

Credit: Adapted from Suggested Notation for Machine Learning