Exploratory Data Analysis

Exploratory Data Analysis (EDA) is an approach for data analysis that employs a variety of techniques (mostly graphical) to

  1. uncover underlying structure;
  2. extract important variables;
  3. detect outliers and anomalies;
  4. test underlying assumptions;
  5. develop parsimonious models; and
  6. determine optimal factor settings.

Types of exploratory data analysis

There are four primary types of EDA:

  • Univariate graphical: Non-graphical methods don’t provide a full picture of the data. Graphical methods are therefore required. Common types of univariate graphics include:
  1. Histograms, a bar plot in which each bar represents the frequency (count) or proportion (count/total count) of cases for a range of values.
  2. Box plots, which graphically depict the five-number summary of minimum, first quartile, median, third quartile, and maximum.
  • Multivariate graphical: Multivariate data uses graphics to display relationships between two or more sets of data. The most used graphic is a grouped bar plot or bar chart with each group representing one level of one of the variables and each bar within a group representing the levels of the other variable.
  • Multivariate chart, which is a graphical representation of the relationships between factors and response.
  • Run chart, which is a line graph of data plotted over time.
  • Bubble chart, which is a data visualization that displays multiple circles (bubbles) in a two-dimensional plot.
  • Heat map, which is a graphical representation of data where values are depicted by color.

EDA explained using stroke-prediction-dataset Data set from Kaggle:

Let’s learn and consider an example dataset to learn practicality. I have taken stroke-prediction-dataset Data which is available on Kaggle. This dataset is used to predict whether a patient is likely to get a stroke based on the input parameters like gender, age, various diseases, and smoking status. Each row in the data provides relevant information about the patient.

Libraries Used
Loading Dataset
shape and columns of the dataset
Data Description
  1. Line Graphs
  2. Pie Graphs
  3. Correlation Matrix
  4. Pair Plot
  5. Scatter Plot
  6. Box Plots
  7. Multivariate Analysis:
Age Distribution using Bar Graph
Gender Distribution using Pie Chart
Correlation Matrix
Pair plot Graph
Box Plots
Scatter Plots
Multivariate Analysis
  1. https://365datascience.com/tutorials/statistics-tutorials/distribution-in-statistics/
  2. https://medium.com/@rubeen.786.mr/exploratory-data-analysis-eda-9d93d37d58fe
  3. https://towardsdatascience.com/exploratory-data-analysis-8fc1cb20fd15

Machine Learning Engineer

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store