| Symbol | Meaning |
|---|---|
| $\mathbf{x}$ | Input instance (usually $\in \mathbb{R}^d$) |
| $\mathbf{y}$ | Output / Label (usually $\in \mathbb{R}^{d_\text{o}}$) |
| $\mathbf{z}$ | Example pair $(\mathbf{x}, \mathbf{y})$ |
| $d$ | Input dimension |
| $d_{\text{o}}$ | Output dimension |
| $n$ | Number of samples |
| $\mathcal{X}$ | Instance domain (set) |
| $\mathcal{Y}$ | Label domain (set) |
| $\mathcal{Z}$ | Example domain ($\mathcal{X}\times\mathcal{Y}$) |
| $\mathcal{D}$ | Distribution over $\mathcal{Z}$ |
| $S$ | Dataset sample $\{(\mathbf{x}_i,\mathbf{y}_i)\}_{i=1}^n$ |
| Symbol | Meaning |
|---|---|
| $\mathcal{H}$ | Hypothesis space |
| $f_{\mathbf{\theta}}$ | Hypothesis function (Model) $f: \mathcal{X}\to\mathcal{Y}$ |
| $\mathbf{\theta}$ | Set of model parameters |
| $f^*$ | Target function (Ground truth) |
| $\sigma$ | Activation function (e.g., ReLU, sigmoid) |
| $\ell$ | Loss function $\ell(f_{\mathbf{\theta}}(\mathbf{x}), \mathbf{y})$ |
| Symbol | Meaning |
|---|---|
| $L_S(\mathbf{\theta})$ | Empirical Risk (Training Loss) on set $S$ |
| $L_\mathcal{D}(\mathbf{\theta})$ | Population Risk (Expected Loss) |
| $\eta$ | Learning rate |
| $B$ | Batch set |
| $|B|$ | Batch size |
| $\text{GD}$ | Gradient Descent |
| $\text{SGD}$ | Stochastic Gradient Descent |
| $\text{VCdim}(\mathcal{H})$ | VC-dimension of hypothesis class |
| $\text{Rad}_S(\mathcal{H})$ | Rademacher complexity on $S$ |
| Symbol | Meaning |
|---|---|
| $m$ | Number of neurons in a hidden layer |
| $L$ | Total number of layers (excluding input) |
| $\mathbf{w}_j, \mathbf{b}_j$ | Weights and bias for specific neuron $j$ |
| $\mathbf{W}^{[l]}$ | Weight matrix for layer $l$ |
| $\mathbf{b}^{[l]}$ | Bias vector for layer $l$ |
| $f^{[l]}$ | Output of layer $l$ |
| $\circ$ | Entry-wise operation (Hadamard product) |
| $*$ | Convolution operation |
Empirical Risk: $$ L_S(\mathbf{\theta})=\frac{1}{n}\sum^n_{i=1}\ell(f_{\mathbf{\theta}}(\mathbf{x}_i),\mathbf{y}_i) $$
2-Layer Network: $$ f_{\mathbf{\theta}}(\mathbf{x})=\sum^m_{j=1}a_j\sigma(\mathbf{w}_j\cdot\mathbf{x}+b_j) $$
General Deep Network (Recursive): $$ f^{[l]}_{\mathbf{\theta}}(\mathbf{x})=\sigma\circ(\mathbf{W}^{[l-1]}f^{[l-1]}_{\mathbf{\theta}}(\mathbf{x})+\mathbf{b}^{[l-1]}) $$
Credit: Adapted from Suggested Notation for Machine Learning