Get our whitepaper on form submission right to your inbox!!!
In the space of classification problems in Machine learning, Random Forest and Logistic Regression are two very popular algorithms that is totally beginner friendly.
First, what is a classification problem
A classification problem is simply when you need to classify an observation into one of the pre-defined categories, depending on the features of that observation.
E.g: Predicting if a customer will take up your next offer
Overview of the algorithms
An extension of a simple decision tree, the only difference been this algorithm provides the combined result of number of such trees, hence the word ‘Forest’.
A single decision tree looks at all the features by itself to classify the observation. But each tree in Random Forest model only look at a randomly selected subset of the complete feature set to arrive at the conclusion, hence the word ‘Random’.
What improves the performance of a Random Forest model against the traditional decision tree model is that, by randomly selecting subsets of features, some trees of the forest have the ability to isolate more important features while increasing the overall accuracy of the result.
Logistic Regression will not predict the exact category your observation should be in, but gives you a probability of each observation falls into the category ‘1’.
The probability is predicted via a simple mathematical calculation which looks like follows.
P (Probability of being ‘1’) = 1/(1+ e-z) where;
Z = C + a1X1 + a2X2 + …. + anXn
Where X1, X2 …., Xn being features of the observation, and a1, a2, …, an are the ‘weights’ of each feature. Higher the weight of a feature, that means that feature is more prominent in making the decision.
Let’s review how each of the models behave in different contexts.
Availability of the algorithms to use
If you’re using python, both the algorithms are available to use readily in scikit-learn (https://scikit-learn.org/stable/) library.
Handling Categorical features
Categorical features are those which classify each observation into different finite set of categories.
E.g: Gender, Country of origin
Most of these features are in ‘text’ form as raw observation, but both the above models accept only numerical data.
Random Forest – Encoding each category with a numerical value will allow the model to perform with the categorical features
Logistic Regression – Since Logistic Regression depends on a calculation based on ‘weights’, numerical encoding of categorical variables can lead the algorithm to treat certain categories are of higher importance compared to others, depending on the number assigned.
E.g: Let’s say we need to classify if a fruit is poisonous or not based on a set of features, which include numerical features such as ‘diameter of the stone (seed)’, ‘thickness of the skin’, ‘time taken to fully ripe’ and categorical features such as ‘colour of the skin’.
Target = 1 if poisonous, 0 if not
Colour of the skin Ꞓ Red, Green, Yellow
If we encode them as
Then, the model will assume the colour being yellow will have a higher chance to make the fruit poisonous, solely due to the value we assume.
To avoid above, in logistic regression (and other ‘weight’ based algorithms) we use a method called ‘one-hot-encoding’. The process involves creating new columns representing each category, where the column value will be ‘1’ if the observation falls into that category.
Ability to extrapolate
Random Forest performs well if the values of the numerical features of the test data is within the range of training data. However, it fails to classify correctly if the test data is outside the training data.
On contrary, Logistic regression performs well even if the numerical features of test data are outside the range of the training data, because it is developed on an arithmetic function.
Flexibility in classifying the end result
The output of the Random Forest model is a classified result, as 1 or 0. The output of the Logistic regression is a probability of the observation falling into the category.
Therefore, latter gives us a better flexibility of deciding how we need to classify the output by changing the threshold probability depending on the application (default threshold we generally use is 0.5)
Application: Predicting if a patient has a particular disease depending on symptoms.
Context: We have enough funds to treat the patients. The treatments have very less side effects. But if left untreated, disease can be fatal. Therefore it is okay if we predict a patient has the disease erroneously, but we cannot misclassify a patient with disease as a healthy patient.
What to do: Reduce the decision threshold of the output, where our false negative rate (% of patients misclassified as healthy) is 0
If your target has more than 2 classifications, then
Random Forest can classify your data into each of them with just one model.
Logistic Regression – for 3 classifications, you need to train 2 models. For n > 3 classifications, you need to train ‘n’ number of models
Deployment of the model
Random Forest – You will need to invoke the same trained model using your client application to run predictions on new data. If you develop the application in a different language you will need to look for ways on how to call your python-based model from the app.
Logistic Regression – You can either invoke the same model or you can export the model coefficients and deploy the mathematical expression elsewhere within your client application itself. This particularly makes deployment of the model easier and intuitive outside python environments.
The ultimate question, Which model performs better?
Entirely depends on your data set. The only way to know is; Test, iterate and test!