Principal component analysis (PCA)

7 min readFeb 1, 2024

Based on:

PCA

Imagine you're working on a big data project, and the dataset contains numerous features. As you initiate your…

arunp77.github.io

Introduction

Imagine you’re working on a big data project, and the dataset contains numerous features. As you initiate your analysis, you may encounter situations where many features are correlated, leading to uncertainty about which features to choose for your analysis. Running a model, such as Regression or another, on the entire dataset may result in poor accuracy, leaving you in a challenging position.

In response, you might consider employing strategic methods to identify important variables. Techniques like Ridge, Lasso, and Elasticnet, known as regularization methods, can help prevent overfitting in machine learning models by adding penalty terms to the loss function. These methods primarily operate with the existing features. On the other hand, Principal Component Analysis (PCA) takes a different approach. It is a dimensionality reduction technique that transforms the original features into a new set of uncorrelated features called principal components. PCA proves beneficial when dealing with a large number of features, providing a way to reduce them while retaining most of the variability in the data.

In summary, while Ridge, Lasso, and Elastic Net focus on regularization, PCA focuses on reducing dimensionality, which can be beneficial in situations with a high number of features or multicollinearity. Statistical techniques such as factor analysis and PCA help to overcome such difficulties in choosing important features. (Look at the following book for reference: Reference-4).

Defining PCA

Principal component analysis (PCA) is a statistical procedure that is commonly used to reduce the dimensionality of large data sets. It does this by transforming the data into a new coordinate system where the new variables are linear combinations of the original variables. The new variables are chosen so that they capture the most variance in the data.

Why PCA is useful?

PCA is useful for several reasons:

Dimensionality reduction: PCA can be used to reduce the dimensionality of large data sets, which can make them easier to analyze and visualize.
Data visualization: PCA can be used to visualize high-dimensional data in a way that is easy to understand.
Linear Transformation: PCA performs a linear transformation of data, seeking directions of maximum variance.
Feature extraction: PCA can be used to extract the most important features from a data set. This can be useful for tasks like classification and clustering by reducing noise and highlighting underlying structures. Principal components are ranked by the variance they explain, allowing for effective feature selection.
Data Compression: PCA can compress data while preserving most of the original information.

Applications of PCA

PCA has a wide range of applications in various fields, including:

Machine learning: PCA is a common preprocessing step in machine learning algorithms. It can be used to reduce the dimensionality of training data, which can improve the performance of the algorithm.
Image analysis: PCA can be used to analyze and compress images. For example, it can be used to reduce the number of pixels in an image without losing much information.
Finance: PCA can be used to analyze financial data, such as stock prices and returns. It can be used to identify patterns in the data and to make predictions about future prices.
Chemistry: PCA can be used to analyze chemical data, such as spectra and molecular structures. It can be used to identify new compounds and to understand the relationships between different compounds.

Limitations of PCA

PCA is a powerful tool, but it also has some limitations:

PCA is based on the assumption that the data is Gaussian. This means that the data should be normally distributed. If the data is not normally distributed, PCA may not be able to accurately capture the most important features of the data.
PCA is not scale-invariant. This means that the results of PCA can be sensitive to the scale of the variables. If the variables are measured on different scales, PCA may not be able to identify the most important features of the data.

Practical example

Let’s consider a scenario, you have a dataset of dimension n=1000 (rows) x p=40 (columns). These 40 columns represent possible features. However, not all columns can be considered to be a feature. There are total row p×(p−1)/2=780 numbers of scattered plots one can generate to see the possible relationships between the features. So it’s almost impossible to find the relationships between the variables. In this case, the correlation of all these can give you a clear image of the important features of your model.

One possibility is to select a subset of the features which captures most of the variance. This can be done by looking at the explained variation ratio (EVR). After calculating the EVR, you can plot a cumulative sum of the explained variance. This plot, often referred to as the ‘scree plot,’ helps visualize how much variance in the data is retained as you include more principal components.

Next, you can set a threshold for the cumulative explained variance that you find acceptable. For example, you might decide to retain 95% of the variance. The corresponding number of principal components that cross this threshold gives you the optimal number of features to keep.

Once you determine the number of principal components to retain, you can use them to transform your original dataset into a reduced-dimensional space. This new dataset contains only the selected principal components, effectively reducing the number of features while retaining most of the information.

It’s important to note that PCA assumes that the features are centered (have a mean of zero) and have similar scales. Therefore, it’s a good practice to standardize or normalize the data before applying PCA.

In summary, PCA is a powerful tool for dimensionality reduction, particularly when dealing with a large number of features. It helps in identifying and retaining the most important information, making your data more manageable for further analysis or model training.

How PCA works

PCA is based on the idea that many real-world data sets are high-dimensional, but that most of the information in the data is contained in a relatively small number of dimensions. This means that we can often reduce the dimensionality of the data without losing much information.
To do this, PCA first calculates the covariance matrix of the data. The covariance matrix is a square matrix that shows how each pair of variables is correlated with each other.
Next, PCA calculates the eigenvectors and eigenvalues of the covariance matrix. The eigenvectors are the directions in which the data varies the most, and the eigenvalues are the magnitudes of the variances along those directions.
The principal components are then formed by taking linear combinations of the original variables, where the coefficients are the corresponding eigenvectors. The first principal component is the direction of greatest variance, the second principal component is the direction of the second greatest variance that is orthogonal to the first principal component, and so on.

Process of doing the PCA

Principal Component Analysis (PCA) is a dimensionality reduction technique commonly used in data analysis and machine learning. It helps uncover the underlying structure in high-dimensional datasets by transforming the data into a new coordinate system, where the axes are aligned with the directions of maximum variance. Here’s a more detailed explanation of the PCA process:

Example- (Multivariable)

Let’s now consider a multicolumn dataset and then do the PCA analysis:

# Step 1: Generate Data and Create a DataFrame
import numpy as np
import pandas as pd
import seaborn as sns

np.random.seed(42)

# Generate random data with 6 variables and more variability
data = np.random.randn(100, 10) * 5  # 100 samples, 6 variables with more variability
columns = ['Var1', 'Var2', 'Var3', 'Var4', 'Var5', 'Var6', 'Var7', 'Var8', 'Var9', 'Var10']
df = pd.DataFrame(data, columns=columns)

# Step 2: Basic Exploratory Data Analysis (EDA)
# You can explore basic statistics, correlations, etc.
print("Basic Statistics:")
df.describe()

Output:

# Step 3: Plot PCA Analysis
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

cumulative_explained_variance = np.cumsum(pca.explained_variance_ratio_)
individual_variance = pca.explained_variance_ratio_

print("Cumulative Explained Variance:")
print(cumulative_explained_variance)

print("\nIndividual Variance Ratio:")
print(individual_variance)

Cumulative Explained Variance:
[0.15598434 0.28624759 0.40740977 0.51957787 0.62005566 0.71457744
0.80326639 0.88455306 0.95248047 1.        ]

Individual Variance Ratio:
[0.15598434 0.13026324 0.12116218 0.1121681  0.10047779 0.09452177
0.08868896 0.08128667 0.06792741 0.04751953]

Now the PCA gave us the following plot:

# Convert cumulative explained variance to percentage
cumulative_explained_variance_percentage = cumulative_explained_variance * 100

# Plot the explained variance ratio with percentage values
plt.figure(figsize=(10, 5))

# Individual explained variance
plt.bar(range(1, 11), individual_variance * 100, alpha=0.7, align='center', label='Individual Variance Ratio')

# Cumulative explained variance with percentage values
plt.plot(range(1, 11), cumulative_explained_variance_percentage, marker='o', linestyle='-', color='r', label='Cumulative Explained Variance Ratio')

# Add percentage values to the plot
for i, percentage in enumerate(cumulative_explained_variance_percentage):
    plt.text(i + 1, percentage + 1, f'{percentage:.2f}%', ha='center', va='bottom', fontsize=8, color='blue')

plt.title('Explained Variance by Principal Components')
plt.xlabel('Principal Components')
plt.ylabel('Variance Explained (%)')
plt.legend()
plt.grid(True)
plt.show()

Description:

We can see that around 80% of the variance is explained by the first 7 principal components. The individual variance ratio for each component indicates its contribution to the overall variance.

For more examples and descriptions, please check my portfolio: https://arunp77.github.io/pca-analysis.html

To connect with me: https://linkedin.com/in/arunp77