Let's define both concepts:
Overfitting
The model learns the training data too well. It also learns the noise and the outliers. These are only present in the training data, so with unseen data the performance will be poor.
2/10
Overfitting
The model learns the training data too well. It also learns the noise and the outliers. These are only present in the training data, so with unseen data the performance will be poor.
2/10
Performance on Training Data: From the definition we see that the performance on training data is really high.
Performance on Test Data: Test data should work as unseen data. If we properly separate it, the overfitted model will perform poorly.
3/10
Performance on Test Data: Test data should work as unseen data. If we properly separate it, the overfitted model will perform poorly.
3/10
Performance on New Data: An overfit model will perform poorly on new, unseen data for the same reasons.
An overfitted model works well only with the training data, so we can spot a bad model early, at the test stage.
4/10
An overfitted model works well only with the training data, so we can spot a bad model early, at the test stage.
4/10
Data Leakage
Data leakage is when some information from outside of the training set spoils the model. The source of this 'infection' is usually from the test set. The model 'sees into the future' and learns information it should have no access to.
5/10
Data leakage is when some information from outside of the training set spoils the model. The source of this 'infection' is usually from the test set. The model 'sees into the future' and learns information it should have no access to.
5/10
Performance on Training Data: Like an overfitted model, a model with data leakage will perform well.
Performance on Test Data: Here is the difference! If the data is leaked, then the model will perform well at this stage. The model could learn some aspects of the test set.
6/10
Performance on Test Data: Here is the difference! If the data is leaked, then the model will perform well at this stage. The model could learn some aspects of the test set.
6/10
The test set should work as an unseen dataset! But here the model had some sort of access to it.
Performance on New Data: In most cases, the performance on new data will drop drastically. The good performance was generated by the leaked info, which has no help here.
7/10
Performance on New Data: In most cases, the performance on new data will drop drastically. The good performance was generated by the leaked info, which has no help here.
7/10
If you have a too-good-to-be-true model after the test, you probably have a leakage!
Data Leakage is way more dangerous since it can give you false high performance during both the training and testing phases.
8/10
Data Leakage is way more dangerous since it can give you false high performance during both the training and testing phases.
8/10
Did you like this post?
Hit that follow button for me and pay back with your support.
It literally takes 1 second for you but makes me 10x happier.
Thanks 😉
9/10
Hit that follow button for me and pay back with your support.
It literally takes 1 second for you but makes me 10x happier.
Thanks 😉
9/10
If you haven't already, join our newsletter DSBoost.
We share:
• Interviews
• Podcast notes
• Learning resources
• Interesting collections of content
dsboost.dev
10/10
We share:
• Interviews
• Podcast notes
• Learning resources
• Interesting collections of content
dsboost.dev
10/10
Loading suggestions...