Data analysis with plotly

7 min readApr 8, 2021

Data science is not just importing libraries and calling a model.fit method.

That part comes way later, Data analysis is important to understand what we are up against. And as Andrew Ng mentions in his video Model-centric to Data-centric AI, we should start focussing on data rather than models.

This video is a must watch for beginners and professionals alike.

There are many tools to facilitate data analysis and I like plotly among all.

Plotly and it’s extension Dash have been downloaded 5 million times a month.

It is so widely used that there is a community AMI on AWS for plotly and a specification for the machine is also on the website.

Plotly: The front end for ML and data science models

Plotly creates & stewards the leading data viz & UI tools for ML, data science, engineering, and the sciences. Language…

plotly.com

Data used in this article is available in this github repo.

SCK22/data

Age Attrition BusinessTravel Department DistanceFromHome Education EducationField EmployeeNumber…

github.com

Let us get started with the data analysis. I chose a rather simple dataset to get started with, Attrition data. Attrition is a concern to any company and is unavoidable, but predicting who will leave the company (and when) might be helpful to plan the operations without major delays.

A quick peek at the data we are dealing with. Below are the columns and the data types that python inferred while reading the csv file mentioned.

Age                          int64
Attrition                   object
BusinessTravel              object
Department                  object
DistanceFromHome             int64
Education                    int64
EducationField              object
EmployeeNumber               int64
EnvironmentSatisfaction      int64
Gender                      object
JobInvolvement               int64
JobLevel                     int64
JobRole                     object
JobSatisfaction              int64
MaritalStatus               object
MonthlyIncome                int64
NumCompaniesWorked           int64
Over18                      object
OverTime                    object
PercentSalaryHike            int64
PerformanceRating            int64
RelationshipSatisfaction     int64
StandardHours                int64
StockOptionLevel             int64
TotalWorkingYears            int64
TrainingTimesLastYear        int64
WorkLifeBalance              int64
YearsAtCompany               int64
YearsInCurrentRole           int64
YearsSinceLastPromotion      int64
YearsWithCurrManager         int64

In this blog I want to focus on the plotting part of the EDA stack, I will skip explaining some of the pre-processing steps used here. But I will add the outline, you can refer to the notebook mentioned at the end of the article for the actual code

Check the data types
Type conversion
Cardinality of data
Null value handling
Separate numeric and categorical columns
Understand the relationship among the variables

2. Model building

3. Deployment.

I will concentrate on understanding the relationship among the variables part of EDA and try to address the other parts in different articles.

Univariate Analysis

What is the attrition rate in the company?

temp = hr_data.Attrition.value_counts()
trace = go.Bar(x=temp.index,
               y= np.round(temp.astype(float)/temp.values.sum(),2),
               text = np.round(temp.astype(float)/temp.values.sum(),2),
               textposition = 'auto',
               name = 'Attrition')
data = [trace]
layout = go.Layout(
    autosize=False,
    width=600,
    height=400,title = "Attrition Distribution"
)fig = go.Figure(data=data, layout=layout)
iplot(fig)
del temp

Attrition rate in the company is 16%

What is the gender distribtion in the company?

temp = hr_data.Gender.value_counts()data = [go.Bar(
            x=temp.index,
            y= np.round(temp.astype(float)/temp.values.sum(),2),
            text = np.round(temp.astype(float)/temp.values.sum(),2),
            textposition = 'auto',
    )]
layout = go.Layout(
    autosize=False,
    width=600,
    height=400,title = "Gender Distribution",
)fig = go.Figure(data=data, layout=layout)
iplot(fig)
del temp

I want to point out a general framework to create barcharts in plotly

- Steps to create a bar chart with counts for a categorical variable
    - create an object and store the counts (optional)
    - create  a bar object
        - pass the x values
        - pass the y values
        - optional :
            - text to be displayed
            - text position
            - color of the bar
            - name of the bar (trace in plotly terminology)
    - create a layout object
        - title - font and size of title
        - x axis - font and size of xaxis text
        - y axis - font and size of yaxis text
    - create a figure object:
        - add data
        - add layout
    - plot the figure object

Want to get fancy with the layout? Here is a function that could help.

def generate_layout_bar(col_name):
    layout_bar = go.Layout(
        autosize=False, # auto size the graph? use False if you are specifying the height and width
        width=800, # height of the figure in pixels
        height=600, # height of the figure in pixels
        title = "Distribution of {} column".format(col_name), # title of the figure
        # more granular control on the title font 
        titlefont=dict( 
            family='Courier New, monospace', # font family
            size=14, # size of the font
            color='black' # color of the font
        ),
        # granular control on the axes objects 
        xaxis=dict( 
        tickfont=dict(
            family='Courier New, monospace', # font family
            size=14, # size of ticks displayed on the x axis
            color='black'  # color of the font
            )
        ),
        yaxis=dict(
#         range=[0,100],
            title='Percentage',
            titlefont=dict(
                size=14,
                color='black'
            ),
        tickfont=dict(
            family='Courier New, monospace', # font family
            size=14, # size of ticks displayed on the y axis
            color='black' # color of the font
            )
        ),
        font = dict(
            family='Courier New, monospace', # font family
            color = "white",# color of the font
            size = 12 # size of the font displayed on the bar
                )  
        )
    return layout_bar

Let us define a function to create a bar chart in plotly

def plot_bar(col_name):
    # create a table with value counts
    temp = hr_data[col_name].value_counts()
    # creating a Bar chart object of plotly
    data = [go.Bar(
            x=temp.index.astype(str), # x axis values
            y=np.round(temp.values.astype(float)/temp.values.sum(),4)*100, # y axis values
            text = ['{}%'.format(i) for i in np.round(temp.values.astype(float)/temp.values.sum(),4)*100],
        # text to be displayed on the bar, we are doing this to display the '%' symbol along with the number on the bar
            textposition = 'auto', # specify at which position on the bar the text should appear
        marker = dict(color = '#0047AB'),)] # change color of the bar
    # color used here Cobalt Blue
     
    layout_bar = generate_layout_bar(col_name=col_name)
    fig = go.Figure(data=data, layout=layout_bar)
    return iplot(fig)

How many people travel? (BusinessTravel)

plot_bar('BusinessTravel')

You can use this function to look at the other categorical columns, refer to the ipynb notebook mentioned at the end of the article.

Age Distribution in the company.

data = [go.Histogram(x=hr_data.Age,
       marker=dict(
        color='#CC0E1D',# Lava (#CC0E1D)
#         color = 'rgb(200,0,0)'   `
    ))]
layout = go.Layout(title = "Histogram of Age".format(n))
fig = go.Figure(data= data, layout=layout)
iplot(fig)

Bivariate analysis

Is a particular gender travelling more distance than other?(Gender and Distance from home)

We can use a box plot to copmare the median distance travelled by each gender.

trace1 = go.Box(y = hr_data.DistanceFromHome[hr_data.Gender=='Male'],name = 'Male',
                boxpoints = 'all',jitter = 1
               )
# boxpoints is used to specify the points to plot
# jitter is used to specify how far from each should the points be
trace2 = go.Box(y = hr_data.DistanceFromHome[hr_data.Gender=='Female'],name= 'Female',
               boxpoints = 'all',jitter = 1
               )
data = [trace1,trace2]
layout = go.Layout(width = 1000,
                   height = 500,title = 'Distance from home and Gender')
fig = go.Figure(data=data,layout = layout)
iplot(fig)

What if we bin the distance attribute and check for the distribution? check the ipynb notebook at the end of the article.

Are married employees staying far from the office? (Marital status and Distance from home)

tracediv = go.Box(y = hr_data.DistanceFromHome[hr_data.MaritalStatus=='Divorced'], name = 'DistanceFromHome')
tracemarried = go.Box(y = hr_data.DistanceFromHome[hr_data.MaritalStatus=='Married'], name= 'Married')
tracesin = go.Box(y = hr_data.DistanceFromHome[hr_data.MaritalStatus=='Single'], name= 'Single')data = [tracediv,tracemarried,tracesin]
layout = go.Layout(width = 800,
                   height = 500,title = 'Distance from home and and Marital Status')
fig = go.Figure(data=data,layout = layout)
iplot(fig)

Based on the median, it does not seem to be that case. Marital status does not affect the distance travelled.

Think of any other relationships you want to understand from the data and plot them to understand the data.

I will write about how you can use clustering and get more insights in another blog. Please let me know if that will help in comments.

For more plots, code and explanation, you can refer to the ipynb file in github repo.

SCK22/Visualizations

Visualizations on data from various sources, including data.gov.in - SCK22/Visualizations

github.com

Edit 1: You can take a look at a much deeper analysis on the dataset here https://github.com/SCK22/Visualizations/blob/master/PythonVisualizationActivity/HR_data_plotly_Visualization.ipynb

I had to remove the plots from output as the file was too heavy with the plots displayed.

Let me know if you would like to see how to use visualization tools or other analysis related posts in the comments.

Did you enjoy the article? Please let me know in the comments, you can connect with me on LinkedIn and we can have a chat about this.

You can also visit my github for some code snippets that might be handy in EDA and Model building.

Connect with me on slack.