Unveiling Insights: Python Data Analysis of E-Commerce Sales Dataset

5 min readMay 17, 2023

Introduction: In today’s digital age, E-commerce has revolutionized the way we shop and conduct business. With the rise of online platforms, enormous amounts of data are generated daily. This treasure trove of information holds valuable insights that can drive strategic decision-making. In this blog, we delve into a Python data analysis of an E-commerce sales dataset, unearthing meaningful patterns and trends that can inform business strategies and optimize performance.

Understanding the Dataset: To begin our analysis, let’s gain some insights into the E-commerce sales dataset we’ll be working with. The dataset comprises various attributes, including customer information, product details, transactional data, and sales records. With Python as our tool of choice, we can leverage powerful libraries like Pandas and NumPy to explore, clean, and analyze this data efficiently.

Data Preprocessing: Before diving into the analysis, it’s crucial to preprocess the dataset. This step involves handling missing values, removing duplicates, and ensuring consistency in data formats. By employing Python’s Pandas library, we can perform these tasks seamlessly, ensuring the integrity of our analysis.

Exploratory Data Analysis (EDA): With the dataset prepared, we can now unleash the power of Python to uncover fascinating insights through EDA. Let’s explore a few key aspects:

Sales Performance Analysis: By aggregating sales data, we can determine the top-selling products, identify peak sales periods, and evaluate the performance of different product categories. Python’s data visualization libraries such as Matplotlib and Seaborn enable us to create informative charts and graphs to better understand sales patterns.
Customer Segmentation: Using techniques like clustering and RFM (Recency, Frequency, Monetary) analysis, we can segment customers based on their purchasing behavior. This allows us to identify high-value customers, understand their preferences, and tailor marketing strategies accordingly.
Geographic Analysis: Analyzing sales data by geographical regions can reveal lucrative markets and highlight areas for potential expansion. Python’s geospatial libraries, such as GeoPandas and Folium, can help create interactive maps to visualize sales patterns geographically.
Seasonal Trends and Forecasting: We can identify seasonal trends and patterns by examining historical sales data. Python provides libraries like Prophet and ARIMA to perform time series analysis and make accurate sales forecasts, enabling businesses to plan inventory, marketing campaigns, and resource allocation effectively.

Insights and Recommendations: Upon completing our data analysis, we uncover valuable insights that can drive actionable recommendations for the E-commerce business:

Product Optimization: Identify underperforming products and focus on improving their sales through marketing campaigns, product enhancements, or pricing strategies.
Targeted Marketing: Tailor marketing efforts by leveraging customer segmentation insights. Create personalized campaigns to target high-value customers and improve customer retention.
Geographical Expansion: Identify regions with high sales potential and consider expanding operations or targeting marketing efforts in those areas.
Inventory Planning: Utilize sales forecasting models to optimize inventory management, ensuring sufficient stock levels during peak demand periods while minimizing excess inventory costs.

Conclusion: Python’s data analysis capabilities empower businesses to extract meaningful insights from complex E-commerce sales datasets. By leveraging Python libraries and techniques, we can uncover patterns, segment customers, identify sales trends, and make informed decisions to drive growth and success. The analysis we conducted is just a glimpse into the vast possibilities that data-driven approaches offer in the realm of E-commerce. So, let’s embrace the power of Python and unlock the potential hidden within our data for a competitive edge in the dynamic world of online retail.

Data analysis and Visualization

I have two data frames: `df_amz` and `df_sale`.

Dataframe-1: df_amz

df_amz.columns

Output: Index(['index', 'Order ID', 'Date', 'Status', 'Fulfilment', 'Sales Channel ',
       'ship-service-level', 'Style', 'SKU', 'Category', 'Size', 'ASIN',
       'Courier Status', 'Qty', 'currency', 'Amount', 'ship-city',
       'ship-state', 'ship-postal-code', 'ship-country', 'promotion-ids',
       'B2B', 'fulfilled-by', 'Unnamed: 22'],
      dtype='object')

2. Dataframe-2: df_sale

df_sale.columns

Output: Index(['SKU', 'Design No.', 'Stock', 'Category', 'Size', 'Color'], dtype='object')

Kind of visualization

Some possible cases for data visualization using different types of charts and graphs:

1. Bar Chart: Visualize categorical data by plotting bars of different heights. Useful for comparing data across different categories or groups.

2. Pie Chart: Represent data as slices of a pie, showing the proportion or percentage distribution of different categories.

3. Line Plot: Display the relationship between two continuous variables by plotting data points connected by lines. Suitable for visualizing trends or patterns over time.

4. Scatter Plot: Plot individual data points in a Cartesian coordinate system to show the relationship between two continuous variables. Useful for identifying correlations or clusters in the data.

5. Histogram: Display the distribution of a single numeric variable by dividing the data into bins and showing the frequency or count of observations in each bin.

6. Box Plot: Visualize the distribution of a continuous variable through quartiles, outliers, and other statistical measures. Helps in identifying skewness, outliers, and variability in the data.

7. Heatmap: Display a matrix of data as a grid of colored squares, where the colors represent the values. Useful for visualizing correlations or patterns in a tabular dataset.

8. Area Chart: Plot the cumulative values of multiple variables over time, showing the contribution of each variable to the total.

9. Violin Plot: Combine the features of a box plot and a kernel density plot to visualize the distribution and density of a variable.

10. Bubble Chart: Represent data points as bubbles on a scatter plot, where the size or color of the bubble represents a third variable.

11. TreeMap: Display hierarchical data as nested rectangles, where the size of each rectangle represents a value.

12. Radar Chart: Display multivariate data on a two-dimensional chart with multiple axes, showing the values of each variable relative to a central point.

So below, we will be plotting these kinds of plots and analyzing the sales, and on the basis of this analysis, we will make some conclusions.

Market view

Visualize the sales quantities or revenue by category using a bar chart:

# Creating subplots with 1 row and 2 columns
fig, axs = plt.subplots(1, 2, figsize=(12, 5))

# Bar Chart of 'Category' in df_amzcopy
category_counts = df_amzcopy['Category'].value_counts()
axs[0].bar(category_counts.index, category_counts.values)
axs[0].set_xlabel('Category')
axs[0].set_ylabel('Count')
axs[0].set_title('Products sold on Amazon vs Category')
axs[0].tick_params(axis='x', rotation=90)

# Bar Chart of 'Size' in df_amzcopy
size_counts = df_amzcopy['Size'].value_counts()
axs[1].bar(size_counts.index, size_counts.values)
axs[1].set_xlabel('Size')
axs[1].set_ylabel('Count')
axs[1].set_title('Products sold on Amazon vs Size')
axs[1].tick_params(axis='x', rotation=90)

# Adjust the spacing between subplots
plt.tight_layout()

# Show the plot
plt.show()