A motivation for Logistic Regression

Logistic Regression is a strange regression model where the dependent variable is categorical. The most common case of categories is binary which means we have two categories. These two categories are usually true and false but could also be alive or dead, pass or fail, buy or do not buy and so on. The basic premise here is the same as anywhere else: use data to create a model which we then use to make decisions on new data. For example, we might want to decide on whether we want to extend a loan to an applicant. We can look at data about previous loan applicants and use the data about those applications to try to decide whether it would be a good idea to give a loan to a new candidate. Well, we won't actually use logistic regression for it because honestly we want to use something better. This is only a stepping stone in our learning journey. Lets take a look at a log generated by scikit-learn with its sample data below. As you can see, logistic regression in scikit-learn does very well with synthetic easy. The data is very easily linearly separable. However, now imagine a model with two concentric circles, the outer circle is red and the inner circle is green. Now, it is not possible to separate those two data sets using a straight line. The best we can hope for is to find such a situation where the density of the red dots and the density of the green dots vary the most to give us a somewhat decent outcome. As you can see below, for circles the model got a score of 0.6 which is slightly over 0.5 for the training data but got 0.4 which is slightly under 0.5 for the test data. This means while we did slightly better than just blindly guessing (like a blind-folded monkey) in our training data set but the model is no good for us because we did worse than just blindly guessing in our test data. One more word about scores: scores range from 0 to 1 as you might have guessed where 0 means we guessed nothing correctly and 1 means we guessed everything correctly.



INFO:root:Creating easy synthetic labeled data set
DEBUG:root:here is the model score for training data in synthetic-easy: 
DEBUG:root:1.0
DEBUG:root:here is the model score for test data in synthetic-easy: 
DEBUG:root:1.0
INFO:root:Creating medium synthetic labeled data set
DEBUG:root:here is the model score for training data in synthetic-medium: 
DEBUG:root:0.84
DEBUG:root:here is the model score for test data in synthetic-medium: 
DEBUG:root:0.82
INFO:root:Creating hard easy synthetic labeled data set
DEBUG:root:here is the model score for training data in synthetic-hard: 
DEBUG:root:0.76
DEBUG:root:here is the model score for test data in synthetic-hard: 
DEBUG:root:0.7
INFO:root:Creating two moons data set
DEBUG:root:here is the model score for training data in moons: 
DEBUG:root:0.86
DEBUG:root:here is the model score for test data in moons: 
DEBUG:root:0.84
INFO:root:Loading iris data set
DEBUG:root:here is the model score for training data in iris: 
DEBUG:root:0.946666666667
DEBUG:root:here is the model score for test data in iris: 
DEBUG:root:0.973333333333
INFO:root:Loading breast cancer data set
DEBUG:root:here is the model score for training data in breast_cancer: 
DEBUG:root:0.954225352113
DEBUG:root:here is the model score for test data in breast_cancer: 
DEBUG:root:0.950877192982
INFO:root:Loading digits data set
DEBUG:root:here is the model score for training data in digits: 
DEBUG:root:0.995545657016
DEBUG:root:here is the model score for test data in digits: 
DEBUG:root:0.948832035595
INFO:root:Creating two circles data set
DEBUG:root:here is the model score for training data in circles: 
DEBUG:root:0.6
DEBUG:root:here is the model score for test data in circles: 
DEBUG:root:0.4

https://en.wikipedia.org/wiki/Logistic_regression