Data analysis with plotly

Chaithanya Kumar
7 min readApr 8, 2021

Data science is not just importing libraries and calling a model.fit method.

That part comes way later, Data analysis is important to understand what we are up against. And as Andrew Ng mentions in his video Model-centric to Data-centric AI, we should start focussing on data rather than models.

This video is a must watch for beginners and professionals alike.

There are many tools to facilitate data analysis and I like plotly among all.

Plotly and it’s extension Dash have been downloaded 5 million times a month.

It is so widely used that there is a community AMI on AWS for plotly and a specification for the machine is also on the website.

Data used in this article is available in this github repo.

Let us get started with the data analysis. I chose a rather simple dataset to get started with, Attrition data. Attrition is a concern to any company and is unavoidable, but predicting who will leave the company (and when) might be helpful to plan the operations without major delays.

A quick peek at the data we are dealing with. Below are the columns and the data types that python inferred while reading the csv file mentioned.

Age                          int64
Attrition object
BusinessTravel object
Department object
DistanceFromHome int64
Education int64
EducationField object
EmployeeNumber int64
EnvironmentSatisfaction int64
Gender object
JobInvolvement int64
JobLevel int64
JobRole object
JobSatisfaction int64
MaritalStatus object
MonthlyIncome int64
NumCompaniesWorked int64
Over18 object
OverTime object
PercentSalaryHike int64
PerformanceRating int64
RelationshipSatisfaction int64
StandardHours int64
StockOptionLevel int64
TotalWorkingYears int64
TrainingTimesLastYear int64
WorkLifeBalance int64
YearsAtCompany int64
YearsInCurrentRole int64
YearsSinceLastPromotion int64
YearsWithCurrManager int64

In this blog I want to focus on the plotting part of the EDA stack, I will skip explaining some of the pre-processing steps used here. But I will add the outline, you can refer to the notebook mentioned at the end of the article for the actual code

  1. EDA
  • Check the data types
  • Type conversion
  • Cardinality of data
  • Null value handling
  • Separate numeric and categorical columns
  • Understand the relationship among the variables

2. Model building

3. Deployment.

I will concentrate on understanding the relationship among the variables part of EDA and try to address the other parts in different articles.

Univariate Analysis

What is the attrition rate in the company?

temp = hr_data.Attrition.value_counts()
trace = go.Bar(x=temp.index,
y= np.round(temp.astype(float)/temp.values.sum(),2),
text = np.round(temp.astype(float)/temp.values.sum(),2),
textposition = 'auto',
name = 'Attrition')
data = [trace]
layout = go.Layout(
autosize=False,
width=600,
height=400,title = "Attrition Distribution"
)
fig = go.Figure(data=data, layout=layout)
iplot(fig)
del temp
Attrition Distribution

Attrition rate in the company is 16%

What is the gender distribtion in the company?

temp = hr_data.Gender.value_counts()data = [go.Bar(
x=temp.index,
y= np.round(temp.astype(float)/temp.values.sum(),2),
text = np.round(temp.astype(float)/temp.values.sum(),2),
textposition = 'auto',
)]
layout = go.Layout(
autosize=False,
width=600,
height=400,title = "Gender Distribution",
)
fig = go.Figure(data=data, layout=layout)
iplot(fig)
del temp
Gender distribution

I want to point out a general framework to create barcharts in plotly

- Steps to create a bar chart with counts for a categorical variable
- create an object and store the counts (optional)
- create a bar object
- pass the x values
- pass the y values
- optional :
- text to be displayed
- text position
- color of the bar
- name of the bar (trace in plotly terminology)
- create a layout object
- title - font and size of title
- x axis - font and size of xaxis text
- y axis - font and size of yaxis text
- create a figure object:
- add data
- add layout
- plot the figure object

Want to get fancy with the layout? Here is a function that could help.

def generate_layout_bar(col_name):
layout_bar = go.Layout(
autosize=False, # auto size the graph? use False if you are specifying the height and width
width=800, # height of the figure in pixels
height=600, # height of the figure in pixels
title = "Distribution of {} column".format(col_name), # title of the figure
# more granular control on the title font
titlefont=dict(
family='Courier New, monospace', # font family
size=14, # size of the font
color='black' # color of the font
),
# granular control on the axes objects
xaxis=dict(
tickfont=dict(
family='Courier New, monospace', # font family
size=14, # size of ticks displayed on the x axis
color='black' # color of the font
)
),
yaxis=dict(
# range=[0,100],
title='Percentage',
titlefont=dict(
size=14,
color='black'
),
tickfont=dict(
family='Courier New, monospace', # font family
size=14, # size of ticks displayed on the y axis
color='black' # color of the font
)
),
font = dict(
family='Courier New, monospace', # font family
color = "white",# color of the font
size = 12 # size of the font displayed on the bar
)
)
return layout_bar

Let us define a function to create a bar chart in plotly

def plot_bar(col_name):
# create a table with value counts
temp = hr_data[col_name].value_counts()
# creating a Bar chart object of plotly
data = [go.Bar(
x=temp.index.astype(str), # x axis values
y=np.round(temp.values.astype(float)/temp.values.sum(),4)*100, # y axis values
text = ['{}%'.format(i) for i in np.round(temp.values.astype(float)/temp.values.sum(),4)*100],
# text to be displayed on the bar, we are doing this to display the '%' symbol along with the number on the bar
textposition = 'auto', # specify at which position on the bar the text should appear
marker = dict(color = '#0047AB'),)] # change color of the bar
# color used here Cobalt Blue

layout_bar = generate_layout_bar(col_name=col_name)
fig = go.Figure(data=data, layout=layout_bar)
return iplot(fig)

How many people travel? (BusinessTravel)

plot_bar('BusinessTravel')
BusinessTravel

You can use this function to look at the other categorical columns, refer to the ipynb notebook mentioned at the end of the article.

Age Distribution in the company.

data = [go.Histogram(x=hr_data.Age,
marker=dict(
color='#CC0E1D',# Lava (#CC0E1D)
# color = 'rgb(200,0,0)' `
))]
layout = go.Layout(title = "Histogram of Age".format(n))
fig = go.Figure(data= data, layout=layout)
iplot(fig)
Age

Bivariate analysis

Is a particular gender travelling more distance than other?(Gender and Distance from home)

We can use a box plot to copmare the median distance travelled by each gender.

trace1 = go.Box(y = hr_data.DistanceFromHome[hr_data.Gender=='Male'],name = 'Male',
boxpoints = 'all',jitter = 1
)
# boxpoints is used to specify the points to plot
# jitter is used to specify how far from each should the points be
trace2 = go.Box(y = hr_data.DistanceFromHome[hr_data.Gender=='Female'],name= 'Female',
boxpoints = 'all',jitter = 1
)
data = [trace1,trace2]
layout = go.Layout(width = 1000,
height = 500,title = 'Distance from home and Gender')
fig = go.Figure(data=data,layout = layout)
iplot(fig)

What if we bin the distance attribute and check for the distribution? check the ipynb notebook at the end of the article.

Are married employees staying far from the office? (Marital status and Distance from home)

tracediv = go.Box(y = hr_data.DistanceFromHome[hr_data.MaritalStatus=='Divorced'], name = 'DistanceFromHome')
tracemarried = go.Box(y = hr_data.DistanceFromHome[hr_data.MaritalStatus=='Married'], name= 'Married')
tracesin = go.Box(y = hr_data.DistanceFromHome[hr_data.MaritalStatus=='Single'], name= 'Single')
data = [tracediv,tracemarried,tracesin]
layout = go.Layout(width = 800,
height = 500,title = 'Distance from home and and Marital Status')
fig = go.Figure(data=data,layout = layout)
iplot(fig)

Based on the median, it does not seem to be that case. Marital status does not affect the distance travelled.

Think of any other relationships you want to understand from the data and plot them to understand the data.

I will write about how you can use clustering and get more insights in another blog. Please let me know if that will help in comments.

For more plots, code and explanation, you can refer to the ipynb file in github repo.

Edit 1: You can take a look at a much deeper analysis on the dataset here https://github.com/SCK22/Visualizations/blob/master/PythonVisualizationActivity/HR_data_plotly_Visualization.ipynb

I had to remove the plots from output as the file was too heavy with the plots displayed.

Let me know if you would like to see how to use visualization tools or other analysis related posts in the comments.

Did you enjoy the article? Please let me know in the comments, you can connect with me on LinkedIn and we can have a chat about this.

You can also visit my github for some code snippets that might be handy in EDA and Model building.

Connect with me on slack.

--

--