Lecture #3 Data Splitting, Evaluation & Distance Metrics | Notion

✂️ 1. Be Careful When Splitting Data

When splitting your dataset into train, validation, and test sets, be mindful of how the data is ordered.

Example: If all samples for Class 1 are grouped at the front and Class 2 at the end, a naive split could result in only Class 1 in the training set — and none of Class 2.
✅ Solution: Use stratified sampling (stratify) to preserve the class proportions in each split.

📊 2. Typical Train/Test Split

A common rule of thumb is 80% train / 20% test, but this depends on your dataset’s size and noise level.
Smaller datasets or noisier data might need different ratios or more cross-validation.

🔄 3. Leave-One-Out Cross-Validation (LOOCV)

Example: If you have 11 data points, you can:
- Train on 10 points, test on 1 point.
- Repeat this process 11 times, each time leaving out a different point.
- Average the error across all runs for a more robust estimate.

⚖️ 4. Weak Law of Large Numbers

The sample mean of a large number of independent and identically distributed (i.i.d.) variables converges in probability to the expected value.
In simple terms: as your sample gets bigger, its average gets closer to the true mean.

📉 5. Expected Loss

The average loss over many independent samples will converge to the expected loss.
This is the theoretical average loss you’d get if you had access to the entire population — not just your sample.