The Universal Principle of Model Evolution: Harnessing Knowledge Distillation, Model Compression, and Rule Extraction
The ML Metamorphosis
Machine learning (ML) model training typically follows a familiar pipeline: start with data collection, clean and prepare it, then move on to model fitting. But what if we could take this process further? Just as some insects undergo dramatic transformations before reaching maturity, ML models can evolve in a similar way — what I will call the ML metamorphosis. This process involves chaining different models together, resulting in a final model that achieves significantly better quality than if it had been trained directly from the start.
How It Works
- Start with some initial knowledge, Data 1.
- Train an ML model, Model A (say, a neural network), on this data.
- Generate new data, Data 2, using Model A.
- Finally, use Data 2 to fit your target model, Model B.
Example: ML Metamorphosis on the MNIST Dataset
Imagine you’re tasked with training a multi-class decision tree on the MNIST dataset of handwritten digit images, but only 1,000 images are labelled. You could train the tree directly on this limited data, but the accuracy would be capped at around 0.67. Alternatively, you could use ML metamorphosis to improve your results.
Techniques Behind ML Metamorphosis
1. Knowledge Distillation (2015)
Knowledge distillation transfers knowledge from a large, complex model (the teacher) to a smaller, more efficient model (the student) using soft labels and a transfer set.
2. Model Compression (2007)
Model compression approximates feature distribution and creates a transfer set to train a smaller, more efficient Model B based on Model A.
3. Rule Extraction (1995)
Rule extraction involves training an interpretable Model B to mimic the behavior of an opaque Model A and derive human-readable rules.
Conclusion
ML metamorphosis isn’t always necessary, but chaining models can yield significantly better results than training the target model directly on the original data.
References
[1] Hinton, Geoffrey. “Distilling the Knowledge in a Neural Network.” arXiv preprint arXiv:1503.02531 (2015).
[2] Introducing Llama 3.2
[3] Sanh, Victor, et al. “DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter.” arXiv preprint arXiv:1910.01108 (2019).
[4] Yin, Tianwei, et al. “One-step diffusion with distribution matching distillation.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024.