Suggested Notation for Machine Learning

Suggested Notation for Machine Learning

Data & Domains

Symbol	Meaning
$\mathbf{x}$	Input instance (usually $\in \mathbb{R}^d$)
$\mathbf{y}$	Output / Label (usually $\in \mathbb{R}^{d_\text{o}}$)
$\mathbf{z}$	Example pair $(\mathbf{x}, \mathbf{y})$
$d$	Input dimension
$d_{\text{o}}$	Output dimension
$n$	Number of samples
$\mathcal{X}$	Instance domain (set)
$\mathcal{Y}$	Label domain (set)
$\mathcal{Z}$	Example domain ($\mathcal{X}\times\mathcal{Y}$)
$\mathcal{D}$	Distribution over $\mathcal{Z}$
$S$	Dataset sample $\{(\mathbf{x}_i,\mathbf{y}_i)\}_{i=1}^n$

Functions & Models

Symbol	Meaning
$\mathcal{H}$	Hypothesis space
$f_{\mathbf{\theta}}$	Hypothesis function (Model) $f: \mathcal{X}\to\mathcal{Y}$
$\mathbf{\theta}$	Set of model parameters
$f^*$	Target function (Ground truth)
$\sigma$	Activation function (e.g., ReLU, sigmoid)
$\ell$	Loss function $\ell(f_{\mathbf{\theta}}(\mathbf{x}), \mathbf{y})$

Training & Complexity

Symbol	Meaning
$L_S(\mathbf{\theta})$	Empirical Risk (Training Loss) on set $S$
$L_\mathcal{D}(\mathbf{\theta})$	Population Risk (Expected Loss)
$\eta$	Learning rate
$B$	Batch set
$\|B\|$	Batch size
$\text{GD}$	Gradient Descent
$\text{SGD}$	Stochastic Gradient Descent
$\text{VCdim}(\mathcal{H})$	VC-dimension of hypothesis class
$\text{Rad}_S(\mathcal{H})$	Rademacher complexity on $S$

Neural Network Specifics

Symbol	Meaning
$m$	Number of neurons in a hidden layer
$L$	Total number of layers (excluding input)
$\mathbf{w}_j, \mathbf{b}_j$	Weights and bias for specific neuron $j$
$\mathbf{W}^{[l]}$	Weight matrix for layer $l$
$\mathbf{b}^{[l]}$	Bias vector for layer $l$
$f^{[l]}$	Output of layer $l$
$\circ$	Entry-wise operation (Hadamard product)
$*$	Convolution operation

Key Formula Reference

Empirical Risk: $$ L_S(\mathbf{\theta})=\frac{1}{n}\sum^n_{i=1}\ell(f_{\mathbf{\theta}}(\mathbf{x}_i),\mathbf{y}_i) $$

2-Layer Network: $$ f_{\mathbf{\theta}}(\mathbf{x})=\sum^m_{j=1}a_j\sigma(\mathbf{w}_j\cdot\mathbf{x}+b_j) $$

General Deep Network (Recursive): $$ f^{[l]}_{\mathbf{\theta}}(\mathbf{x})=\sigma\circ(\mathbf{W}^{[l-1]}f^{[l-1]}_{\mathbf{\theta}}(\mathbf{x})+\mathbf{b}^{[l-1]}) $$

Credit: Adapted from Suggested Notation for Machine Learning