Often I come across posts and comments from people where they make claims like ‘Linear regression is all about predictions’.
Well they are wrong but I don’t quite blame them. Thanks to the machine learning take over of statistical nomenclatures, any prediction task is now labelled as ‘Regression task’ !!
This is of course two distinguish from the classification tasks. This peculiar nomenclature borne out of poor understanding is why Logistic regression is wrongly considered as a classification algorithm.
Both Adrian Olszewski and I have written extensively about why logistic regression is regression. And thanks largely to undeterred efforts of Adrian, even scikit learn changed its documentation to reflect the fact that logistic regression is regression but used as a classification when an external cut off (threshold) like 0.5, 0.6 etc. is placed.
You can check out all the relevant links to the ‘Logistic regression is regression’ saga in the resources section.
Ok now lets come back to why Linear Regression is not all about predictions.
So here are some clues.
โ The goodness of fit measures.
Lets take the example of R squared value. Well if you thought R squared value measures predictive power of liner regression, sorry you are wrong.
R squared value is a goodness of fit measure. R squared value is more about retrodiction than prediction. I know you might be wondering ‘what is retrodiction?’.
In short retrodiction is about making predictions about the past.
But does R squared value do that ?
Yes, R squared value being a goodness of fit measure, tests how well does the model describes the data that went into generating that fit in the first place.
Also the word ‘goodness of fit’ itself offers clue that we are testing how well the model fits the data.
You see the earliest tools to gauge Linear Regression’s performance was goodness of fit and not some MAPE or actual vs predicted metrics.
โ Estimation of Parameters.
The second biggest clue is that Linear regression has its origin in design of experiments.
Although nobody thinks about Linear Regression this way anymore, it does not mean that it ceases to exist.
Linear regression is still about sampling. That is drawing samples, and estimating the parameters through the sample statistic.
It is because of this sampling philosophy that you still see Confidence Interval in the output tables. Again a link allaying misconception about CI is in the resources section.
So, overall you see the goal is to estimate the parameters not just prediction.
Prediction is just a positive side effect of having fit the model correctly. And you have fit the model correctly because you have good estimates of the parameter.
I hope now it is clear why Linear regression is not only about prediction.
Resources:
- Why R squared value is not about predictive power: https://www.linkedin.com/posts/venkat-raman-analytics_linearregression-datascience-statistics-activity-7001158665948852224-RSTh?utm_source=share&utm_medium=member_desktop
- Adrian Olszewski‘s scikit learn note : https://github.com/scikit-learn/scikit-learn/issues/24611
- Linear regression understanding (see comments of this post): https://www.linkedin.com/posts/venkat-raman-analytics_linearregression-datascience-statistics-activity-6977852891667603457-ZYN4?utm_source=share&utm_medium=member_desktop
- Clearing misconception about confidence Intervals: https://www.linkedin.com/posts/venkat-raman-analytics_statistics-datascience-analytics-activity-6817705233091825664-RhN0?utm_source=share&utm_medium=member_desktop
- Logistic Regression meme: https://bit.ly/3qsHhiV
- Logistic Regression: Classification if you use Python, Regression if you use R: https://www.linkedin.com/posts/venkat-raman-analytics_machinelearning-datascience-artificialintelligence-activity-6806113669873831936-oxyA?utm_source=share&utm_medium=member_desktop