✂️ 1. Be Careful When Splitting Data
When splitting your dataset into train, validation, and test sets, be mindful of how the data is ordered.
- Example: If all samples for Class 1 are grouped at the front and Class 2 at the end, a naive split could result in only Class 1 in the training set — and none of Class 2.
- ✅ Solution: Use stratified sampling (stratify) to preserve the class proportions in each split.
📊 2. Typical Train/Test Split
- A common rule of thumb is 80% train / 20% test, but this depends on your dataset’s size and noise level.
- Smaller datasets or noisier data might need different ratios or more cross-validation.
🔄 3. Leave-One-Out Cross-Validation (LOOCV)
- Example: If you have 11 data points, you can:
- Train on 10 points, test on 1 point.
- Repeat this process 11 times, each time leaving out a different point.
- Average the error across all runs for a more robust estimate.
⚖️ 4. Weak Law of Large Numbers
- The sample mean of a large number of independent and identically distributed (i.i.d.) variables converges in probability to the expected value.
- In simple terms: as your sample gets bigger, its average gets closer to the true mean.
📉 5. Expected Loss
- The average loss over many independent samples will converge to the expected loss.
- This is the theoretical average loss you’d get if you had access to the entire population — not just your sample.