Logisitc regression ==> simple Algorithm
Naive Bayes ==> No geometric intuition. Learnt it through Probabilistic technique
Logistic Regression can be learnt from three perspectives
- Geometric intuition
- probabilistic approach which involves dense mathematics
- Loss minimization framework
- Classes are linearly or almost linearly seperable plane
- Plane passes thru the origin, this ensures that we don’t have the intercept term ‘b’, in the equation for distance of the point from a plane
- W is a unit vector (w=1) and hence ||w|| can be ignored once again the equation for distance of the point from the plane.
If data is linearly seperable → a line/hyper plane can seperate the +ve/-ve points without any mis-classified points. If data is almost linearly seperable → then the same line/hyper plane can seperate the points but there will be some point that will be mis-classified.
Line → linear decision surface
circle/parabola → quadratic decision surface
Find the plane phi, that linearly or almost linearly or best seperates the +ve/-ve points
Finally let’s say we found/discover a plane Phi and W its normal . This W is assumed as a unit vector. Then we can say all points on the same side of W are positive points and all points on the opposite/different side of W would be negative points. The decision surface in Logistic Regression is a line/plane.
Lets consider the term
The above implies that the classifier correctly classified the point as positive
Here the classifier correctly classified the point as negative
Summary of the above four cases is that if the signed distance is greater than zero then the point is correctly classified else if the signed distance is less than zero the point is a mis-classified point.
For a classifier to be very good we need to have
- maximum number of correctly classified points
- minimum number of mis-classified points
Mathematically this means,
Sigmoid function Squashing
It is not always possible to get data as depicted in Figure 1. Let us assume we have a dataset as depicted in the figure below
- a set of positive points on the same side as our normal
- a set of negative points on the opposite side of the normal
- one outlier point on the same side of the normal
We will have the same setting, but in this case we will change where we draw the plane. Lets draw the plane in such a manner so that both the positive and negative points are on the same side of the normal W and the outlier point is on the other side of the normal W. The figure below shows the setting for this case
Our objective is to find the plane that maximizes the sum of signed distances. In our setting we need to compare plane 1 and plane 2. In this case we need to compare -90 and 1 and the maximum value is 1 which corresponds to plane 2. The classifier will choose plane 2 as the decision surface (seperating plane) or hyper-plane(in high dimension).
According to the above metrics phi-1 is the best seperator whereas the classifier is choosing phi-2 as the decision surface.
One single extreme /outlier point is changing the way the classifier model behaves.This is because sum of signed distances is being impacted by outlier points. In the real-world, data can contain many outliers and hence deciding the decision surface using the signed distance will be impacted by outliers a lot . Hence choosing the decision surface using the signed distance may not produce the correct plane .
If we don’t want outliers to impact our optimization function then we need to use SQUASHING.
The big idea here is to make small modification to the formulae for simple signed distance.
If the distance is small then we keep it as is but if the distance is large we make it a small value.
In our case for phi-1 we have small distance of +1/-1, but we also have a large or high value of 100. Instead of using this large/high value we will use a smaller value.
We will create graph to geometrically depict the meaning of the above statement. On the X-axis we will have signed distance and on the y-axis we will have f(signed distance).At the origin x=0, it may be zero or it may be some other value , for our example let’s say it’s 0.5. As x-value increases , the y-values will increase sharply or linearly till the x-value reaches lets say a value of 10. For values greater than 10 the y-values will tapper off and the maximum value will not be more than a particular value. The same behaviour will be seen for the negative value of x. From x-values of 0 to -10 there will be a linear increase in y-value and after that the y-values will tapper off and will remain constant at a maximum value.
This is something similar to hashing, where we have a function f, that can take as input the signed distance . If the input is small, the same small value is returned as output. In case the input is large then, the input value is tapered off to a smaller value and returned as output. If the input is greater than a threshold value then this tapered off value is always returned as output.
There are many functions or algorithms that have this behaviour, one such function f is the SIGMOID function
This function f should behave like this
- small value → small value
- large value → tapered off maximum value
The inference from the above graph :
- for small values for x there is a linear increase in y values
- After a threshold value of x , the value tapers off and maximum constant value is returned. Any point that is very far away on the x-axis , the y-axis value is the maximum constant value.
- At x=o(which means distance=0) which is the origin the value of sigma(0) is 0.5
Reasons why we choose Sigmoid function ?
- Nice probabilistic interpretation
Consider that the point is on the hyperplane. For this point the distance is 0 and hence W transpose x-i will be zero. In this case the point is on the plane phi. If distance is equal to 0 then p(y=1) will be 0.5. This probability of 0.5, also implies that point can be either +ve/-ve point.
Consider a point that is very far away from hyper plane. W transpose x-i is large. This point is +ve in the same direction of W and is very far away from the plane phi, then its p(y=1) is very large 0.99999 which is very close to 1
Sigmoid function squashing is when distances than tend from -infinity to + infinity is converted to a range of values between 0 to 1 using the sigmoid function, such that small distance values have linear behaviour and large distance values are tapered off.
The above equation looks a bit complicated and we would like to simplify it to some extent. Since, this is an optimization problem we would like to use the property of applying strictly increasing or decreasing monotonic functions to the equation. This applications does not change the basic nature of the equation and the argmax or argmin is also protected.
log(x) is a monotonically increasing function, we will apply to this to our equation.
Time Space Complexity at Train / Runtime
Training Logistic regression means solving the optimization problem.
Trian time complexity = O(nd) where
n = number of points in Dtrain
d = dimensionality of the points
At the end of training we will get W* . This W* is a vector of d points.
The space complexity at runtime is to keep in memory or store the d points of this W* vector and nothing else. Hence, the space complexity becomes O(d).
At runtime, the time complexity is the time taken to multiple the Xq point with W vector.The size of both depend on the dimensionality of the Xq point.
Time at Runtime = d multiplications + (d-1) additions → O(d)
The dimensionality of the data decides the performance of Logistic regression.
If the dimensionality of the data points are small say between 30 and 50 then logistic regression is good and will also be able to meet the low latency requirements if needed.
If d is large then extra effort needs to be put in to make logistic regression perform better. The first option would be to use L1 — regularizer. By doing this we would increase the sparsity within the weight vector. Sparsity within the weight vector means making the less important features weights to zero. This reduces the number of multiplications and additions that we need to perform thus by reducing the time taken by Logistic regression to decide on Yq for the given Xq. .
How to handle situations where there is a low latency requirement or if there is a constraint on the hardware resoures that are available to execute the classification algorithm.
Lets say given an Xq we need to calculate Yq under n milliseconds and also that we can perform limited number of multiplications and additions. If we cannot respond within the stipulated time limit the solution will not be accepted. Now, we start increasing our lambda which will increase sparsity and we stop this when the weight vector has those many weights that equals the number of multiplications that we can perform.
When there is a latency requirement that takes precedence over other factors like bias and variance. Is it good to have a model that cannot make it to production. As lambda increases bias will increase(underfitting) and hence the model performance will suffer. The compromise is to have a bad working model rather than not having a model at all.You dont stop here as always the model should be fine tuned to increase the model performance.
Non Linearly seperable data
Initially we cannot seperate the data but after performing feature transform of squaring the data points we can draw a plane that can seperate the positive and negative points. Logistic regression will work in the space of f1-dash and f2-dash. When we get a query point we do the same transformation and then send it to the model like depicted in the figure below
Trasformations for real valued features
Like AND, XOR,OR
Like log(x), e(x).
There is a problem with using log(x) is that most of the time in the real-world there are lots of zeroes in the data set. log(0) is not defined and hence the log transformation may not be a viable option.