Exploratory Data Analysis (EDA) is an approach for data analysis that employs a variety of techniques (mostly graphical) to
- maximize insight into a data set;
- uncover underlying structure;
- extract important variables;
- detect outliers and anomalies;
- test underlying assumptions;
- develop parsimonious models; and
- determine optimal factor settings.
As a Machine Learning engineer, one of the first steps implemented as part of a machine learning project is Exploratory Data Analysis.
Exploratory Data Analysis refers to the critical process of performing initial investigations on data so as to discover patterns, spot anomalies, test hypotheses, and check assumptions with the help of summary statistics and graphical representations.
Types of exploratory data analysis
There are four primary types of EDA:
- Univariate non-graphical: This is the simplest form of data analysis, where the data being analyzed consists of just one variable. Since it’s a single variable, it doesn’t deal with causes or relationships. The main purpose of the univariate analysis is to describe the data and find patterns that exist within it.
- Univariate graphical: Non-graphical methods don’t provide a full picture of the data. Graphical methods are therefore required. Common types of univariate graphics include:
- Stem-and-leaf plots, which show all data values and the shape of the distribution.
- Histograms, a bar plot in which each bar represents the frequency (count) or proportion (count/total count) of cases for a range of values.
- Box plots, which graphically depict the five-number summary of minimum, first quartile, median, third quartile, and maximum.
- Multivariate nongraphical: Multivariate data arises from more than one variable. Multivariate non-graphical EDA techniques generally show the relationship between two or more variables of the data through cross-tabulation or statistics.
- Multivariate graphical: Multivariate data uses graphics to display relationships between two or more sets of data. The most used graphic is a grouped bar plot or bar chart with each group representing one level of one of the variables and each bar within a group representing the levels of the other variable.
Other common types of multivariate graphics include:
- Scatter plot, which is used to plot data points on a horizontal and a vertical axis to show how much one variable is affected by another.
- Multivariate chart, which is a graphical representation of the relationships between factors and response.
- Run chart, which is a line graph of data plotted over time.
- Bubble chart, which is a data visualization that displays multiple circles (bubbles) in a two-dimensional plot.
- Heat map, which is a graphical representation of data where values are depicted by color.
EDA explained using stroke-prediction-dataset Data set from Kaggle:
Let’s learn and consider an example dataset to learn practicality. I have taken stroke-prediction-dataset Data which is available on Kaggle. This dataset is used to predict whether a patient is likely to get a stroke based on the input parameters like gender, age, various diseases, and smoking status. Each row in the data provides relevant information about the patient.
The python Libraries used for EDA are mentioned below
Step 1) Load the dataset
The first step to any data science project is to import data. Often, you’ll work with data in Comma Separated Value (CSV) files and run into problems at the very start of the workflow.
We used the read_csv() function from pandas to import data and load CSV files specifically with pandas. Loading CSV files specifically with pandas have become standard practice for working data scientists today.
Step 2: Basic Information about the data
Return a tuple representing the dimensionality of the DataFrame.
The column labels of the DataFrame.
This method prints information about a DataFrame including the index dtype and columns, non-null values, and memory usage.
Descriptive statistics include those that summarize the central tendency, dispersion, and shape of a dataset’s distribution, excluding
Now we got a good glimpse of the data. Let’s now explore data with beautiful graphs. Python has a visualization library, Seaborn which build on top of matplotlib. It provides very attractive statistical graphs in order to perform both Univariate and Multivariate analysis. Let's start with plotting graphs and understanding the distribution of data.
A distribution in statistics is a function that shows the possible values for a variable and how often they occur. The distribution of an event consists not only of the input values that can be observed but is made up of all possible values.
They are different ways to understand data. The most commonly used methods for plotting the data are as follows
- Bar Graphs
- Line Graphs
- Pie Graphs
- Correlation Matrix
- Pair Plot
- Scatter Plot
- Box Plots
- Multivariate Analysis:
Let's look at one example for each type using python.
The correlation function uses the Pearson correlation coefficient, which results in a number between -1 to 1. A strong negative relationship is indicated by a coefficient closer to -1 and a strong positive correlation is indicated by a coefficient toward 1.
You can understand the relationship attributes by looking at the distribution of the interactions of each pair of attributes. This uses a built-in function to create a matrix of scatter plots of all attributes against all attributes.
A boxplot is a standardized way of displaying the distribution of data based on a five-number summary i.e., minimum, first quartile (Q1), median, third quartile (Q3), and maximum. It can tell us also about outliers.
A scatter plot (aka scatter chart, scatter graph) uses dots to represent values for two different numeric variables. The position of each dot on the horizontal and vertical axis indicates values for an individual data point. Scatter plots are used to observe relationships between variables.
Multivariate methods look at two or more variables at a time to explore relationships. Usually, our multivariate EDA will be bivariate(looking at exactly two variables), but occasionally it will involve three or more variables.
This was just a small introduction to Exploratory Data Analysis. After reading this I’d recommend you pick up any dataset of your choice and perform EDA all by yourself.
Thanks a lot if you’ve made it this far. If you have any suggestions to make this blog better, please do mention them in the comments.