Hyperparameters
What number K do we use, which distance metric? How do we actually make choices on what these should be? These are what we call hyperparameters - choices about the algorithm that we set rather than learn. There’s no way to directly learn these from the data.So how do you set these hyperparameters in practice? They turn out to be very problem dependent.
- Try different values for hyperparameters for your set of data/problem, and figure out which one works best. (TERRIBLE IDEA)
- What does it mean to try different hyperparameters and see what works best?
- Choosing hyperparameters that work best on data
- If k=1, accuracy = 100%. Generalizes horribly to new data
- What does it mean to try different hyperparameters and see what works best?
- Split data into train/test (BAD IDEA)
- Test - estimate of how method will do on unseen data
- Training - training diff algorithms with different hyperparameters.
- Split data into train/validation/test (Better!)
- Training - training diff algorithms with different hyperparameters.
- Validation - validate the different algorithm’s performance and tune hyperparameters for greatest accuracy.
- Test - test at the very last second to ensure good performance on “unseen data” The classifier with the set of hyperparameters that performs best is used, and performance on this set is the ultimate number that gets reported.
- “Evaluate on the test set only a single time, at the very end.”
- Cross-Validation (Specific)
- Used more commonly for small datasets - not so much in deep learning.
- Split data into folds, try each fold as validation & average the results
- Cycle through choosing which fold will be the validation set.
- Train hyperparameters on folds 1-4, Validate on fold 5. Test
- Train hyperparameters on folds 1-3, Validate on fold 4, train again on fold 5. Test
- Train hyperparameters on folds 1-2, Validate on fold 3, train again on folds 4-5. - Test → so forth
If the test set/new data isn’t drawn from the same probability distribution, the classifier may not be representative of data out in the wild. Thus the underlying assumption is that all your data is independently and identically distributed.