Revise Supervised Machine Learning Algorithms

Get Insights of Supervised Machine Learning Algorithms

13 min readAug 14, 2023

Basic Terminologies

Supervised Learning: Teaching the computer using labeled examples, so it can make predictions or classifications when given new data.

Unsupervised Learning: Letting the computer find patterns on its own in unlabeled data, like grouping similar things together without being told how.

Classification: Training the computer to put things into categories, like deciding if an email is spam or not based on patterns in past emails.

Regression: Predicting a number, like estimating the price of a house based on its features, using patterns from existing data.

Linear Regression

“Linear Regression” is named as such because it’s specifically designed to model and predict continuous values (numeric quantities) by fitting a linear relationship between the input features and the target variable. The term “regression” comes from the statistical concept of estimating relationships between variables, and “linear” indicates the specific form of that relationship in this case.

Linear regression aims to minimize the sum of squared differences between predicted and actual values. It’s simple, interpretable, and works well when the relationship between features and target is linear.

Linear Regression Unique Traits:

Focuses on predicting continuous numeric values.
The output is a straight line that captures the relationship between features and the target.
Minimizes the sum of squared errors.
Used for tasks like predicting prices, sales, or any numeric quantity.

Assumptions: Linear regression has certain assumptions, such as the assumption of linearity (relationship between variables), independence of errors (no correlation between residuals), and homoscedasticity (constant variance of residuals). Violations of these assumptions can affect the validity of the model’s results.
Bias-Variance Trade-off: Linear regression can suffer from the bias-variance trade-off. A simple linear model might have high bias (underfitting), while a complex model might have high variance (overfitting). Striking the right balance is crucial for good model performance.
Multicollinearity: This occurs when two or more predictor variables are highly correlated, leading to issues in interpreting the importance of individual variables. It can also cause instability in coefficient estimates.
Regularization: Regularized linear regression techniques like Lasso (L1 regularization) and Ridge (L2 regularization) add penalty terms to the loss function to prevent overfitting. They shrink the coefficient estimates towards zero, potentially improving generalization.
Heteroscedasticity: This is when the residuals (the differences between predicted and actual values) have varying levels of scatter as you move along the predictor variable. It can impact the reliability of the model’s predictions.
Outliers and Leverage Points: Linear regression can be sensitive to outliers, which are data points that deviate significantly from the rest of the data. Leverage points are observations that have an unusually high or low value of a predictor variable, affecting the model’s fit.
Feature Scaling: While linear regression is not sensitive to feature scaling (i.e., the scale of predictor variables), feature scaling can sometimes improve convergence speed during optimization.
Endogeneity: Endogeneity arises when a predictor variable is correlated with the error term. This violates the assumption of exogeneity (the error term is not correlated with any predictor variable). It can lead to biased and inconsistent coefficient estimates.
P-Values and Significance: P-values associated with coefficient estimates indicate whether those estimates are statistically significant. However, relying solely on p-values for variable selection can be problematic, especially in the presence of multicollinearity.
Interpretability: The coefficient estimates in linear regression can be directly interpreted. For example, a coefficient of 0.5 for a predictor variable means that, on average, a one-unit increase in that variable is associated with a 0.5-unit increase in the response variable.

Logistic Regression

Logistic Regression is a statistical method used for modeling the probability of a binary outcome or event based on one or more predictor variables. Despite its name, it is actually a type of regression analysis that is used for classification tasks. It’s a fundamental algorithm in machine learning and is particularly useful when dealing with problems where the dependent variable (the outcome we’re trying to predict) is categorical and has only two possible outcomes, typically represented as 0 and 1, or “negative” and “positive.”

The term “logistic” in Logistic Regression is derived from the mathematical function it uses to model the probability of the binary outcome. This function is called the logistic function (also known as the sigmoid function). The sigmoid function takes any input and maps it to a value between 0 and 1, which can be interpreted as the probability of the event occurring.

Maximum Likelihood Estimation: The coefficients in Logistic Regression are typically estimated using the Maximum Likelihood Estimation (MLE) technique. MLE finds the parameter values that maximize the likelihood of the observed data given the model.
Non-linear Decision Boundaries: Although it’s called “Logistic Regression,” the decision boundaries created by the model can be non-linear. This is achieved through feature transformations or by introducing interaction terms.
Multinomial Logistic Regression: Logistic Regression can be extended to handle more than two classes. In this case, it’s known as Multinomial Logistic Regression. It models the probabilities of each class and uses the class with the highest probability as the prediction.
Regularization Techniques: Just like linear regression, Logistic Regression can also benefit from regularization techniques like L1 (Lasso) and L2 (Ridge) regularization. These techniques help prevent overfitting and can improve the model’s generalization. Logistic Regression can also be combined with both L1 and L2 regularization, which is known as Elastic Net regularization.
Imbalanced Data: Logistic Regression can be sensitive to class imbalance, where one class has significantly more examples than the other. Techniques like oversampling, undersampling, or using different class weights can help address this.
Assumption of Linearity: While Logistic Regression doesn’t require a linear relationship between predictors and the log-odds of the outcome, violating this assumption can affect the model’s performance. Techniques like feature engineering or using polynomial terms can address non-linearity.

Naive Bayes

Naive Bayes is a simple and efficient classification algorithm. It predicts the probability of an event based on observed features, assuming they’re independent. It’s often used in text categorization and spam filtering. Despite its basic assumptions, Naive Bayes can provide effective results and works well when you have limited data or need a quick solution.

Assumption of Independence: Naive Bayes relies on the assumption of feature independence, which is why it’s called “naive.” It assumes that the presence of one feature is independent of the presence of other features, given the class label. While this assumption might not always hold true in real-world data, Naive Bayes can still perform surprisingly well in practice, especially when the features are only weakly correlated.
Zero Probability Problem: One challenge with Naive Bayes is the potential for zero probabilities. If a particular combination of features in the test data has never been seen in the training data, the conditional probability for that combination becomes zero. This can cause the entire product of probabilities in the Bayes’ formula to be zero, leading to inaccurate predictions. To mitigate this issue, techniques like Laplace smoothing or additive smoothing are often used to assign small probabilities to unseen combinations.
Continuous and Categorical Data: Naive Bayes can handle both continuous and categorical data. For continuous features, a common approach is to assume a certain distribution (like Gaussian) and estimate the parameters from the training data. For categorical features, probabilities are directly calculated based on the frequency of occurrences in the training set.
Multinomial Naive Bayes: While the classic Naive Bayes algorithm is often associated with text classification (using the bag-of-words model), there’s also a variant called Multinomial Naive Bayes. It’s specifically designed for discrete data, like text, where the feature values represent counts (e.g., word frequencies).
Feature Engineering: Feature engineering can significantly impact Naive Bayes’ performance. Choosing relevant and discriminative features is important, as the model’s performance heavily depends on the features’ predictive power.
Imbalanced Classes: Naive Bayes can struggle with imbalanced class distributions. If one class has significantly more examples than the other, the model might become biased towards the majority class. Techniques like oversampling, undersampling, or using more advanced versions of Naive Bayes (e.g., weighted Naive Bayes) can help mitigate this issue.
Text Classification: Naive Bayes is widely used in text classification tasks, such as spam detection or sentiment analysis. Despite its simplicity, it can often outperform more complex algorithms due to its effectiveness in handling high-dimensional and sparse data.
Online Learning: Naive Bayes is well-suited for online learning scenarios where new data arrives incrementally. Since the model updates probabilities based on new data without requiring access to the entire training dataset, it can adapt to changing conditions efficiently.
Ensemble Methods: While Naive Bayes is generally not used directly in ensemble methods like random forests or boosting, it can serve as a base classifier in these ensembles, contributing to a diverse set of classifiers.

K-Nearest Neighbours

K-Nearest Neighbors (KNN) is a simple yet powerful machine learning algorithm used for classification and regression tasks. In KNN, the idea is to make predictions based on the “neighbors” of a data point.

Neighbor Selection: Given a new data point (an instance with features), KNN identifies the K closest data points from the training dataset based on a distance metric, often using Euclidean distance.
Voting (Classification) or Averaging (Regression): For classification tasks, KNN takes a majority vote among the K neighbors to determine the class label of the new data point. For regression tasks, it calculates the average (or weighted average) of the target values of the K neighbors to predict a numeric value.
Hyperparameter K: The value of K is a crucial hyperparameter in KNN. It determines the number of neighbors to consider when making predictions. A small K can make the model sensitive to noise, while a large K can make the model overly biased.
Distance Weighting: In some cases, you can assign different weights to the neighbors based on their distance. Closer neighbors might have more influence on the prediction, which can be particularly useful when you want to prioritize nearby instances.
Feature Scaling: Feature scaling is important in KNN because it’s sensitive to the scale of the features. Features with larger scales can dominate the distance calculation, so it’s common to normalize or standardize the data before applying KNN.
Computational Cost: One drawback of KNN is its computational cost during prediction. To make predictions, it needs to compare the new instance with all training instances to find the nearest neighbors. This can be time-consuming, especially with large datasets.
Curse of Dimensionality: KNN can struggle when dealing with high-dimensional data because the concept of distance becomes less meaningful as the number of dimensions increases. As a result, the nearest neighbors might not be as informative.

It’s particularly useful when you don’t have prior knowledge about the underlying data distribution and want a non-parametric method. However, its performance can vary depending on the choice of K, distance metric, and the nature of the data.

Support Vector Machines

Support Vector Machines (SVM) is a versatile machine learning algorithm used for both regression and classification tasks. It aims to find a hyperplane (a decision boundary) that best separates or fits the data points of different classes or predicts target values in the case of regression. The “support vectors” are crucial elements in SVM that play a central role in defining this hyperplane.

In classification, SVM seeks to find a hyperplane that maximizes the margin between the classes. The margin is the distance between the hyperplane and the nearest data points of each class, known as support vectors. These support vectors are the data points that are closest to the decision boundary and influence the position and orientation of the hyperplane.

SVM aims to achieve a balance between maximizing the margin and minimizing the classification error. Depending on the type of SVM (linear, polynomial, radial basis function, etc.), the algorithm finds the hyperplane that best separates the classes while taking into account the characteristics of the data and the chosen kernel function. The choice of kernel determines how SVM transforms the input data into a higher-dimensional space to find a suitable decision boundary.

In SVM regression, the hyperplane is chosen to have a margin of tolerance around the actual target values. The data points that fall within this margin are considered support vectors. The algorithm’s goal is to find the hyperplane that minimizes the sum of the margin violations and the difference between predicted and actual target values.

Decision Tree

The decision tree is a fundamental algorithm in machine learning that is used for both classification and regression tasks. They work by recursively partitioning the data into subsets based on the values of input features, ultimately leading to decision rules that can be used to make predictions or classify new instances.

Decision trees have several unique characteristics that make them stand out as a machine learning algorithm:

Interpretability: Decision trees provide a clear and intuitive representation of how decisions are being made. The tree structure is easy to understand and can be visualized graphically, making it accessible to both technical and non-technical audiences.
Non-linearity: Decision trees can capture non-linear relationships in the data. They can handle complex decision boundaries that are not easily achievable using linear models.
Feature Importance: Decision trees naturally rank the importance of features by their placement in the tree. Features that appear closer to the root are more influential in making decisions.
Handles Missing Values: Decision trees can handle missing values without requiring imputation. They simply direct data down different branches based on available information.
Ensemble Methods: Decision trees serve as building blocks for powerful ensemble methods like Random Forests and Gradient Boosting. These methods combine multiple decision trees to improve accuracy and generalization.
Mixed Data Types: Decision trees can work with both categorical and numerical features without requiring explicit feature scaling.
Outliers: Decision trees are robust to outliers since they partition data into subsets based on simple rules.
No Assumptions: Decision trees do not make assumptions about the distribution of data or the relationship between features, making them versatile across various types of datasets.
Feature Interaction: Decision trees naturally capture feature interactions, allowing them to identify complex patterns where the effect of one feature depends on the presence of another.
Quick Learning: Decision trees tend to require less data preprocessing compared to other algorithms. They can work well with raw or slightly preprocessed data.

The structure of a decision tree is similar for both regression and classification tasks, but the way it makes decisions and assigns outcomes differs.

Decision Tree Structure in Classification:

Root Node: The top node of the tree, represents the entire dataset. It chooses the feature that best splits the data based on a certain criterion (e.g., Gini impurity or entropy). The selected feature becomes the root’s decision rule.
Internal Nodes: These nodes represent decisions based on features. Each internal node tests a feature’s value, leading to different branches based on the feature’s possible values. These nodes continue to split the data into subsets.
Leaf Nodes: These are the endpoints of the branches, representing the final decision or class assignment. Each leaf node contains a class label that the instance belongs to, based on the majority class in that subset.

Decision Tree Structure in Regression:

Root Node: Similar to classification, the root node starts by choosing a feature that best splits the data based on a criterion (e.g., mean squared error). The selected feature and threshold become the root’s decision rule.
Internal Nodes: These nodes also represent decisions based on features, but instead of class labels, they predict numeric values. Similar to classification, each internal node tests a feature’s value and leads to different branches based on the feature’s possible values.
Leaf Nodes: In regression, the leaf nodes contain predicted numeric values. These values are usually computed as the average (or weighted average) of the target values in that subset.

Random Forest

Random Forest is an ensemble machine learning algorithm that combines multiple decision trees to create a more robust and accurate model for both classification and regression tasks. The algorithm constructs a “forest” of decision trees and aggregates their predictions to make more reliable and robust predictions.

Workflow of the Random Forest Algorithm:

Data Collection and Preprocessing:
Gather the dataset and preprocess it by handling missing values, encoding categorical features, and scaling numerical features.
Bootstrapped Sampling (Random Sampling with Replacement):
For each decision tree in the forest, create a random subset of the training data by selecting samples with replacements. This process is called bootstrapped sampling.
The size of each subset is typically the same as the original dataset, but some samples may appear more than once, while others may not appear at all.
Random Feature Selection:
At each node of each decision tree, a random subset of features is selected to split the node.
This helps to decorrelate the trees and reduces overfitting, ensuring that no single feature dominates the decision-making process.
Construction of Decision Trees:
For each bootstrapped dataset, construct a decision tree using a specific criterion (such as Gini impurity for classification or mean squared error for regression).
Recursively split the nodes based on the selected features, aiming to create binary splits that maximize information gain (classification) or minimize impurity (classification) or variance (regression).
Voting (Classification) or Averaging (Regression):
For classification tasks, each tree “votes” for a class, and the class with the most votes becomes the final prediction.
For regression tasks, each tree’s prediction is averaged to obtain the final regression prediction.
Ensemble Aggregation:
The individual decisions of multiple trees are combined to make a final prediction, reducing the impact of individual errors.
This aggregation helps in improving accuracy, reducing overfitting, and increasing model stability.
Model Evaluation:
Evaluate the performance of the Random Forest model on a separate validation or test dataset using appropriate metrics (accuracy, F1-score, RMSE, etc.).
Adjust hyperparameters (number of trees, maximum depth, etc.) through techniques like cross-validation to optimize model performance.

AdaBoost

AdaBoost, short for Adaptive Boosting, is an ensemble learning algorithm used for classification tasks. It combines the predictions of multiple “weak” classifiers (typically decision trees with limited depth) to create a strong and more accurate classifier. AdaBoost focuses on learning from the mistakes of previous classifiers and assigning higher weights to misclassified instances, allowing subsequent weak classifiers to focus on those instances.

Workflow of the AdaBoost Algorithm:

Data Collection and Preprocessing:
Gather the dataset and preprocess it by handling missing values, encoding categorical features, and scaling numerical features.
Initialize Weights:
Assign equal weights to all training instances. These weights represent the importance of each instance in the learning process.
Loop through Iterations (T):
For each iteration, train a weak classifier (often a decision tree) on the training data using the current weights.
The weak classifier’s performance is evaluated using a weighted error rate, which considers the instance weights. It is calculated as the sum of the weights of misclassified instances.
Compute the weight of the weak classifier in the final ensemble based on its performance (accurate classifiers receive higher weight).
Adjust the instance weights:
Increase the weights of misclassified instances, making them more likely to be correctly classified in the next iteration.
Decrease the weights of correctly classified instances to focus less on them.
Normalize the instance weights to sum to one, ensuring they remain probabilities.
Aggregate Predictions:
Combine the predictions of all weak classifiers based on their weights. The final prediction is made by summing the weighted predictions of each weak classifier.
Model Evaluation:
Evaluate the performance of the AdaBoost model on a separate validation or test dataset using appropriate metrics (accuracy, F1-score, etc.).

AdaBoost’s strength lies in its adaptability to difficult datasets by focusing on previously misclassified instances. It improves the overall accuracy of the ensemble by giving more emphasis to instances that are harder to classify correctly. However, AdaBoost is sensitive to noisy data and outliers, which can adversely affect its performance.

Super Interesting Blog Regarding Machine Learning Algorithms Assumptions, Pros and Cons