Pythonic discoveries: How to master the art of Exploratory data analysis with python
Having a glance at the title of this blog, a question that might come to one’s mind is ‘why at all should data be explored ?’
if we have data, let’s look at data. if all we have are opinions, let’s go with mine…Jim Barksdale
The objectivity, accuracy, consistency and evidence based nature of data makes it preferable to personal opinions whenever a strategic decision is to be made. Embracing a data-driven culture is not optional for businesses seeking sustainable growth and long-term success. By leveraging the power of Exploratory Data Analysis, businesses can harness the potential of their data to drive growth, optimize processes, and provide superior products and services to their customers.
Python, a programming language with powerful libraries such as Pandas, NumPy, and Matplotlib, provides a versatile and efficient environment for conducting Exploratory Data Analysis. This article gives a step by step report of how Exploratory Data Analysis via Python was utilized to uncover valuable insights and hidden trends in a given dataset.
Introduction
The dataset considered for the purpose of this exploration was an historical data of a supermarket company which has coalited in it a 3 month sales data from 3 different branches of the company. A data dictionary giving details about the elements of the dataset is given below, with the dataset available at https://www.kaggle.com/aungpyaeap/supermarket-sales
The process in which the Exploratory Data Analysis was carried out is subdivided into 5 tasks highlighted below:
Task 1 :Initial Data Exploration
Task 2: Univariate Analysis
Task 3: Bivariate Analysis
Task 4:Dealing with duplicate rows and missing values
Task 5: Correlation Analysis
Initial Data Exploration
Firstly, the different libraries required for the analysis were imported into the Python Jupyter notebook.
These libraries play vital roles in the project. Pandas and NumPy handled data manipulation and numerical computations, while Matplotlib and Seaborn were crucial for creating various types of data visualizations. Calmap specialized in creating calendar heatmaps, making it useful for time-based data visualization. Together, these libraries provided a powerful toolkit for analysis, visualization, and interpretation, increasing the possibility of gaining insights and making informed decisions from the datasets.
Secondly the flat csv file was read into the library, the data type of the date column was converted from an object to a ‘datetime’, it was also set as the index column of the dataframe. Setting the date column as the index in data analysis allows for efficient time-based operations, improved data retrieval, and simplified data manipulation and visualization, making it a preferred approach in many time-series data analysis scenarios.
Univariate Analysis
Univariate analysis is particularly useful for gaining an initial understanding of any given dataset and for identifying potential patterns or trends within its individual variables. It is a crucial step in the exploratory data analysis (EDA) process, which aims to uncover insights and relationships in the data before moving on to more complex analyses. Three questions were asked and answered at this phase of the analysis. The questions are:
What does the distribution of the customer rating look like?
Do aggregate sales number differ by much between branches?
Do the different payment methods vary by much?
Bivariate Analysis
Unlike univariate analysis, which focuses on analysing one variable at a time, bivariate analysis involves studying the interactions and dependencies between two variables simultaneously. By exploring the relationship between two variables, data analysts can gain deeper insights into how they are related and how changes in one variable impact the other. Four questions were asked and answered at this phase of the analysis. The questions are:
Is there a significant relationship between gross income and customer ratings?
Is there a significant relationship between gross income and gender?
Is there a significant relationship between the different branches and their gross income?
Is there a noticeable time trend in gross income?
Dealing with duplicate rows and missing values
By effectively dealing with duplicate rows and missing values, the quality and reliability of data can be ensured , paving the way for more reliable and accurate exploratory data analysis. During this task phase, the duplicated rows in the dataset were initially removed
An heatmap visualizing the missing values was created with the aid of the seaborn library.
Two steps were taken to ensure that the missing values of each column in the dataset was removed. Initially, a line of code was written to represent the missing values with the mean value of its respective column.
Nevertheless, since replacing the missing value with the mean can only work for columns that are quantitative, the missing values of the categorical column were represented by the mode of the respective columns
Once, the missing value were replaced, the heatmap was re-visualised. The totally black space depicts the successful replacement of all missing values
Correlation Analysis
Correlation analysis provides valuable insights into the relationships between variables in a dataset. It helps identify potential predictors, understand the strength and direction of associations, and inform further analysis and modelling decisions. For this project, a correlation matrix between all the variables of the table was created. This correlation matrix was also represented on a heat map.
Conclusion
Mastering exploratory data analysis (EDA) for business is a crucial skill for data-driven decision-making and gaining actionable insights from business data. Exploratory data analysis always would involve a systematic approach to exploring and understanding the characteristics of data, uncovering patterns, and identifying trends to make informed business decisions. It forms the foundation for successful data analysis and business intelligence, leading to improved operational efficiency, customer satisfaction, and competitive advantage.
Thank you
For access to the python code used for this project, please do not hesitate to visit my github page https://github.com/thebolujames/Exploratory-Data-Analysis-in-Python