Regression algorithms
Regression algorithms are a type of machine learning algorithm used to predict numerical values based on input data. Regression algorithms attempt to find a relationship between the input variables and the output variable by fitting a mathematical model to the data. The goal of regression is to find a mathematical relationship between the input features and the target variable that can be used to make accurate predictions on new, unseen data.
There are many different types of regression algorithms, including:
- Linear regression: Linear regression is a simple and widely used algorithm. It assumes a linear relationship between the independent variables and the target variable. The algorithm estimates the coefficients of the linear equation that best fits the data. The equation can be of the form: y= m x +c, where y is the target variable, x is the input feature, m is the slope, and c is the intercept (For details see).
Example: applications include predicting housing prices based on features like square footage and number of bedrooms or estimating sales based on advertising expenditure.
2. Logistic regression: Logistic regression is a popular algorithm used for binary classification problems where the target variable has two possible outcomes (e.g., yes/no, true/false, 0/1). Despite its name, logistic regression is a classification algorithm, not a regression algorithm. It models the relationship between the independent variables (input features) and the binary target variable using the logistic function, also known as the sigmoid function (see the reference).
Example: predicting whether a customer will churn (i.e., stop doing business with a company) based on their demographic information and purchase history.
3. Polynomial regression: Polynomial regression is an extension of linear regression where the relationship between the variables is modeled using a polynomial equation. This allows for more flexibility in capturing nonlinear relationships between the input features and the target variable. It involves adding polynomial terms, such as x² or x³, to the linear equation. Polynomial regression is useful when the data exhibits curvilinear patterns (see the reference).
Example: predicting the yield of a crop based on factors such as temperature, humidity, and rainfall.
4. Ridge regression: Ridge regression is a regularization technique that addresses the issue of overfitting in linear regression. It adds a penalty term to the linear regression equation to control the complexity of the model. This penalty term helps prevent the coefficients from becoming too large, reducing the model’s sensitivity to the training data. Ridge regression is particularly useful when dealing with high-dimensional data or when multicollinearity (high correlation) exists among the input features (see the reference).
Example: predicting the price of a stock based on financial indicators such as earnings per share and price-to-earnings ratio.
5. Lasso regression: Lasso regression, similar to ridge regression, is a regularization technique used to combat overfitting. It adds a penalty term to the linear regression equation, but in this case, it uses the L1 norm of the coefficients as the penalty. Lasso regression has a feature selection property that can drive some coefficients to zero, effectively performing automatic feature selection. This makes it useful when dealing with datasets with many features or when looking to identify the most influential variables (see the reference).
Example: predicting the likelihood of a customer purchasing a product based on their browsing and purchase history on a website.
6. Elastic Net Regression: ElasticNet regression combines both ridge and lasso regularization techniques. It adds a penalty term that is a linear combination of the L1 (lasso) and L2 (ridge) norms of the coefficients. This hybrid approach allows for feature selection while also providing stability and reducing the impact of multicollinearity. ElasticNet regression is useful when there are many correlated features, and the goal is to select relevant features and mitigate multicollinearity (see the reference).
Example: predicting the demand for a product based on factors such as price, advertising spend, and competitor activity.
Why do we use Regression Analysis?
Regression analysis is a statistical method used to examine the relationship between a dependent variable and one or more independent variables. It is used for a variety of purposes, including:
- Prediction: Regression analysis can be used to predict the values of the dependent variable based on the values of the independent variables. For example, if we want to predict the sales of a product based on advertising expenditure and the size of the market, we can use regression analysis to determine the relationship between these variables and predict the sales based on the values of the independent variables.
- Hypothesis testing: Regression analysis can be used to test hypotheses about the relationship between the dependent and independent variables. For example, we can test whether there is a significant relationship between smoking and lung cancer by using regression analysis.
- Control variables: Regression analysis can be used to control for other variables that may affect the relationship between the dependent and independent variables. For example, if we want to examine the relationship between income and health, we may want to control for variables such as age, gender, and education.
- Forecasting: Regression analysis can be used to forecast future trends based on historical data. For example, we can use regression analysis to forecast the demand for a product based on past sales data and other relevant variables.
Overall, regression analysis is a useful tool for analyzing and understanding the relationship between variables and for making predictions and informed decisions based on that relationship.
How to Open the Jupyter Notebook
All the codes are available on my GitHub repository: https://github.com/arunp77/Machine-Learning/
- Open in Google Colab (https://colab.research.google.com/) and sign in Google account. Create a new notebook and Mount Google Drive (to save the notebook to your Google Drive) using:
from google.colab import drive
drive.mount('/content/drive')
Follow the instructions provided to authorize access to your Google Drive.
Fetch the notebook from GitHub: In a new code cell in your Colab notebook, use the following code to fetch the notebook from the GitHub repository and load it into Colab:
!pip install gitpython
import git
# Clone the repository
git.Repo.clone_from('https://github.com/arunp77/Machine-Learning.git', '/content/Machine-Learning')
# Load the notebook into Colab
from google.colab import files
files.open('/content/Machine-Learning/ML-Fundamental/0.2-Regression.ipynb')
2. Clone the repo: into a designated directory (Let’s suppose it is cloned in the ‘Documents’ directory)
Path:
- Windows: \name\Personal\Documents\
- MacOs X: /Users/name/Documents/
- Linux: Users/name/Documents/
git clone https://github.com/arunsinp/Machine-Learning.git
Go to the Documents folder by using ‘cd Documents/ML-Learning/ML-Fundamental/’ and then open the file ‘0.2-Regression.ipynb’ using a code editor like ‘VSCODE’ or in Anaconda open Jupyter Notebook and load the ‘0.2-Regression.ipynb’.
NOTE: In my next blog post, I will write about Linear Regression (Both simple and multiple linear regression) and a small project.
Reference
- My GitHub repository (Jupyter Notebook): https://github.com/arunp77/Machine-Learning/blob/main/ML-Fundamental/0.2-Regression.ipynb (To open it, you can open it in Google Colab or download it and open it using )
- All Machine learning codes are available at my GitHub repository: https://github.com/arunp77/Machine-Learning
- For the regression models like simple linear, multiple linear regression, polynomial regression, Ridge, Lasso and Elasticnet regression, please see the link: https://github.com/arunp77/Machine-Learning/tree/main/Projects-ML/Reg-models