Least Squares Curve Fitting: Original Data, Fitted Curve, and Residuals

Dr. Arun Kumar Pandey (Ph.D.)
6 min readJul 26, 2023

--

The least squares method is a widely used approach in statistics and numerical analysis to find the best-fitting parameters of a mathematical model to a given set of data points. It is particularly useful when dealing with models that are linear in their parameters. The method aims to minimize the sum of the squared differences between the observed data points and the corresponding values predicted by the model.

Suppose we have a set of data points (xᵢ, yᵢ), where xᵢ is the input variable and yᵢ is the corresponding output variable. We want to find the parameters (β) of a model (f(xᵢ, β)) that best fits the data. The model can be linear or nonlinear in the parameters, but we will focus on the linear case.

For a linear model, the general form is:

f(xi, β) = β0 + β1 * φ1(xi) + β2 * φ2(xi) + … + βn * φn(xi)

where:

  • β0, β1, …, βn are the unknown parameters we want to estimate.
  • φ1(xᵢ), φ2(xᵢ), …, φn(xᵢ) are functions of the input variable xᵢ. These functions can be simple polynomials of xi or any other relevant functions.

The goal of the least squares method is to find the values of the parameters (β0, β1, …, βn) that minimize the sum of the squared differences between the observed data points (yi) and the predicted values from the model (f(xi, β)):

minimize: Σ(yᵢ— f(xᵢ, β))²

To find the optimal parameters (β), we define the objective function (also known as the residual sum of squares, RSS):

RSS(β) = Σ(yᵢ — f(xᵢ, β))²

Now, we need to find the values of β that minimize the RSS. This is achieved by taking the partial derivative of the RSS with respect to each parameter βj and setting it to zero. Solving these equations simultaneously gives us the optimal values of β.

The normal equations are a set of equations that arise from taking the partial derivatives and setting them to zero. In matrix form, the normal equations for the linear least squares problem can be written as:

where:

  • X is the design matrix, which contains the values of φi(xᵢ) for each data point.
  • X^T is the transpose of X.
  • y is the vector of observed output variables.
  • β is the vector of unknown parameters we want to estimate.

Solving the normal equations gives us the optimal values for β:

β = (X^T * X)^(-1) * X^T * y

Once we have the optimal values of β, we can use them to make predictions for new data points using the model f(xᵢ, β).

It is important to note that the least squares method assumes that the errors in the data are normally distributed and have constant variance. If the assumptions are violated (e.g., the errors have heteroscedasticity or follow a different distribution), alternative methods like weighted least squares or nonlinear least squares may be more appropriate. However, for many practical cases, the least squares method provides reliable and efficient parameter estimates.

Example: 1

Fitting an exponential function y=a * exp(-k*b), here a and b are unknown.

import numpy as np
from scipy.optimize import curve_fit
import matplotlib.pyplot as plt

# Define the model function
def model_func(x, a, k):
return a * np.exp(-k * x)

# Generate synthetic data points
np.random.seed(0)
x_data = np.linspace(0, 5, 50)
y_true = 2.5 * np.exp(-1.3 * x_data)
y_noise = 0.2 * np.random.normal(size=x_data.size)
y_data = y_true + y_noise

# Fit the model to the data
popt, pcov = curve_fit(model_func, x_data, y_data)

# Generate points for the fitted curve
x_fit = np.linspace(0, 5, 100)
y_fit = model_func(x_fit, *popt)

# Plot the original data and the fitted curve
plt.scatter(x_data, y_data, label='Data')
plt.plot(x_fit, y_fit, 'r-', label='Fitted Curve')
plt.xlabel('x')
plt.ylabel('y')
plt.legend()
plt.show()

Printing the values of a and k:

# Print the values of 'a' and 'k'
a_fit, k_fit = popt
print("Fitted value of 'a':", a_fit)
print("Fitted value of 'k':", k_fit)
Fitted value of 'a': 2.7911924412250135
Fitted value of 'k': 1.3047603469851417

In curve fitting, it is essential to assess the goodness of fit to determine how well the chosen model fits the data. There are several ways to evaluate the quality of the exponential fit to the data using Python. Some common methods include:

  1. Visual Inspection: Plot the original data points along with the fitted curve. Visually inspect whether the fitted curve passes close to the data points and captures the overall trend. This is a qualitative assessment.
  2. R-squared (Coefficient of Determination): R-squared measures the proportion of the variance in the dependent variable (y) that is predictable from the independent variable (x) through the fitted model. A higher R-squared value (closer to 1) indicates a better fit. However, R-squared can be misleading for nonlinear models.
  3. Residual Analysis: Calculate the residuals, which are the differences between the observed data points and the corresponding values predicted by the fitted curve. Plot the residuals to check for any patterns or systematic deviations from zero.
residual_i = y_data[i] - y_pred[i]

The Python code to fit the exponential function and show the fitted values are as follows:

# Calculate the residuals
residuals = y_data - model_func(x_data, *popt)

# Plot the residuals
plt.scatter(x_data, residuals)
plt.axhline(0, color='red', linestyle='--')
plt.xlabel('x')
plt.ylabel('Residuals')
plt.show()

# Calculate R-squared
ss_total = np.sum((y_data - np.mean(y_data))**2)
ss_residual = np.sum(residuals**2)
r_squared = 1 - (ss_residual / ss_total)
print(f"R-squared: {r_squared}")
R-squared: 0.9155784119990455

Example: 2

Another example: y= a* exp(-b*x)+c

def func(x, a, b, c):
return a * np.exp(-b * x) + c

xdata1 = np.linspace(0, 4, 50)
y = func(xdata1, 2.5, 1.3, 0.5)
rng = np.random.default_rng() # random number generator
y_noise = 0.2 * rng.normal(size=xdata1.size)
ydata1 = y + y_noise
popt, pcov = curve_fit(func, xdata1, ydata1)
plt.plot(xdata1, ydata1, 'b-', label='data')
plt.plot(xdata1, func(xdata1, *popt), 'r-',
label='fit: a=%5.3f, b=%5.3f, c=%5.3f' % tuple(popt))
# Calculate the residuals
residuals1 = ydata1 - func(xdata1, *popt)

# Plot the residuals
plt.scatter(xdata1, residuals1)
plt.axhline(0, color='red', linestyle='--')
plt.xlabel('x')
plt.ylabel('Residuals')
plt.show()

# Calculate R-squared
ss_total1 = np.sum((ydata1 - np.mean(ydata1))**2)
ss_residual1 = np.sum(residuals1**2)
r_squared1 = 1 - (ss_residual1 / ss_total1)
print(f"R-squared: {r_squared1}")
# Print the values of 'a' and 'k'
afit, bfit, cfit = popt
print("Fitted value of 'a':", afit)
print("Fitted value of 'k':", bfit)
print("Fitted value of 'k':", cfit)
Fitted value of 'a': 2.5417360082249534
Fitted value of 'k': 1.1357480435780296
Fitted value of 'k': 0.4488680282275794

Reference:

--

--

No responses yet