Pythonic discoveries: How to master the art of Exploratory data analysis with python

6 min readJul 30, 2023

--

Having a glance at the title of this blog, a question that might come to one’s mind is ‘why at all should data be explored ?’

if we have data, let’s look at data. if all we have are opinions, let’s go with mine…Jim Barksdale

The objectivity, accuracy, consistency and evidence based nature of data makes it preferable to personal opinions whenever a strategic decision is to be made. Embracing a data-driven culture is not optional for businesses seeking sustainable growth and long-term success. By leveraging the power of Exploratory Data Analysis, businesses can harness the potential of their data to drive growth, optimize processes, and provide superior products and services to their customers.

Python, a programming language with powerful libraries such as Pandas, NumPy, and Matplotlib, provides a versatile and efficient environment for conducting Exploratory Data Analysis. This article gives a step by step report of how Exploratory Data Analysis via Python was utilized to uncover valuable insights and hidden trends in a given dataset.

Introduction

The dataset considered for the purpose of this exploration was an historical data of a supermarket company which has coalited in it a 3 month sales data from 3 different branches of the company. A data dictionary giving details about the elements of the dataset is given below, with the dataset available at https://www.kaggle.com/aungpyaeap/supermarket-sales

The process in which the Exploratory Data Analysis was carried out is subdivided into 5 tasks highlighted below:

Task 1 :Initial Data Exploration

Task 2: Univariate Analysis

Task 3: Bivariate Analysis

Task 4:Dealing with duplicate rows and missing values

Task 5: Correlation Analysis

Initial Data Exploration
Firstly, the different libraries required for the analysis were imported into the Python Jupyter notebook.

Importing libraries into the jupyter notebook

These libraries play vital roles in the project. Pandas and NumPy handled data manipulation and numerical computations, while Matplotlib and Seaborn were crucial for creating various types of data visualizations. Calmap specialized in creating calendar heatmaps, making it useful for time-based data visualization. Together, these libraries provided a powerful toolkit for analysis, visualization, and interpretation, increasing the possibility of gaining insights and making informed decisions from the datasets.

Secondly the flat csv file was read into the library, the data type of the date column was converted from an object to a ‘datetime’, it was also set as the index column of the dataframe. Setting the date column as the index in data analysis allows for efficient time-based operations, improved data retrieval, and simplified data manipulation and visualization, making it a preferred approach in many time-series data analysis scenarios.

Reading the csv file into the jupyter notebook

Converting the data type of the Date column into ‘datetime’

Setting the date column as the index column for the data frame

Univariate Analysis
Univariate analysis is particularly useful for gaining an initial understanding of any given dataset and for identifying potential patterns or trends within its individual variables. It is a crucial step in the exploratory data analysis (EDA) process, which aims to uncover insights and relationships in the data before moving on to more complex analyses. Three questions were asked and answered at this phase of the analysis. The questions are:

What does the distribution of the customer rating look like?

Code depicting the distribution of the customer ratings

Plot showing the distribution of customer ratings

Do aggregate sales number differ by much between branches?

Plot showing aggregate sales number between the different branches

Do the different payment methods vary by much?

Plot showing the differences between the three payment methods

Bivariate Analysis
Unlike univariate analysis, which focuses on analysing one variable at a time, bivariate analysis involves studying the interactions and dependencies between two variables simultaneously. By exploring the relationship between two variables, data analysts can gain deeper insights into how they are related and how changes in one variable impact the other. Four questions were asked and answered at this phase of the analysis. The questions are:

Is there a significant relationship between gross income and customer ratings?

A scatter plot with a regression line showing the relationship between gross income and customer rating

Is there a significant relationship between gross income and gender?

A box plot showing the impact gross income has on both gender

Is there a significant relationship between the different branches and their gross income?

A box plot showing the impact of gross income on the three branches

Is there a noticeable time trend in gross income?

Dealing with duplicate rows and missing values
By effectively dealing with duplicate rows and missing values, the quality and reliability of data can be ensured , paving the way for more reliable and accurate exploratory data analysis. During this task phase, the duplicated rows in the dataset were initially removed

This code drops the duplicated values in the dataset

An heatmap visualizing the missing values was created with the aid of the seaborn library.

Heatmap depicting the missing values in the dataset

Two steps were taken to ensure that the missing values of each column in the dataset was removed. Initially, a line of code was written to represent the missing values with the mean value of its respective column.

Representing the missing values with the mean value of its respective columns

Nevertheless, since replacing the missing value with the mean can only work for columns that are quantitative, the missing values of the categorical column were represented by the mode of the respective columns

Replacing the missing values of the ‘categorical column’ with its mode

Once, the missing value were replaced, the heatmap was re-visualised. The totally black space depicts the successful replacement of all missing values

Depicting the successful replacement of all missing values.

Correlation Analysis
Correlation analysis provides valuable insights into the relationships between variables in a dataset. It helps identify potential predictors, understand the strength and direction of associations, and inform further analysis and modelling decisions. For this project, a correlation matrix between all the variables of the table was created. This correlation matrix was also represented on a heat map.

Correlation matrix between all the variables of the dataset

Representing the correlation matrix on a heat map.

Conclusion
Mastering exploratory data analysis (EDA) for business is a crucial skill for data-driven decision-making and gaining actionable insights from business data. Exploratory data analysis always would involve a systematic approach to exploring and understanding the characteristics of data, uncovering patterns, and identifying trends to make informed business decisions. It forms the foundation for successful data analysis and business intelligence, leading to improved operational efficiency, customer satisfaction, and competitive advantage.

Thank you

For access to the python code used for this project, please do not hesitate to visit my github page https://github.com/thebolujames/Exploratory-Data-Analysis-in-Python

Pythonic discoveries: How to master the art of Exploratory data analysis with python

Written by Boluwatife James

No responses yet