Introduction

For my first blog post, I thought it would be fun to present an abridged version of an analysis of a synthetic dataset from Kaggle that contains information from about 15,000 employees of a company regarding their satisfaction level, number of projects, seniority, and other metrics of their employment, along with a binary variable indicating whether they left the company or not (view the full RMarkdown document here). The purpose of this analysis is to visualize multivariate relationships among the data that may explain what is causing employees to leave, and to utilize different modeling techniques to most accurately predict whether an employee will leave the company or not. For this post, I will stick to the graphs I liked the most and briefly discuss the results of my modeling. After viewing the structure of the data, I choose to change multiple factor variables coded as numerics to help with visualization and modeling.

set.seed(1234) #for reproducibility
library(dplyr)
hr <- read.csv("C:/Users/Evan/Downloads/HR_comma_sep.csv")
glimpse(hr)
## Observations: 14,999
## Variables: 10
## $ satisfaction_level    <dbl> 0.38, 0.80, 0.11, 0.72, 0.37, 0.41, 0.10...
## $ last_evaluation       <dbl> 0.53, 0.86, 0.88, 0.87, 0.52, 0.50, 0.77...
## $ number_project        <int> 2, 5, 7, 5, 2, 2, 6, 5, 5, 2, 2, 6, 4, 2...
## $ average_montly_hours  <int> 157, 262, 272, 223, 159, 153, 247, 259, ...
## $ time_spend_company    <int> 3, 6, 4, 5, 3, 3, 4, 5, 5, 3, 3, 4, 5, 3...
## $ Work_accident         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ left                  <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ promotion_last_5years <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ sales                 <fctr> sales, sales, sales, sales, sales, sale...
## $ salary                <fctr> low, medium, medium, low, low, low, low...

Visualization

Through visualizing histogram distributions of each variable, we can see that most employees have 3-4 projects and have been with the company for less than three years, with few making high salaries, corresponding to the high number of sales jobs represented in the dataset. There are two large groups of employees who work ~150 hours a month and ~250 hours a month, and in general they display a wide range of satisfaction levels, though slightly more positive than negative. I hypothesized that average hours worked per month, satisfaction level, and seniority level play the biggest role in employee churn, so I wanted to investigate these relationships further through more graphing. For these plots, the employees who left were subsetted in order to focus on their specific characteristics, and a few notable clusters start to emerge in each plot. I enjoy using the Plotly package to create interactive graphs, and the ease of its integration with ggplot2 cannot be understated.

left <- subset(hr, left == 1)
names(left) <- c("satisfac", "eval", "proj", "hours", "years","accident","left","promote", "job", "salary")
library(plotly)
library(ggplot2)
ggplotly(ggplot(left, aes(x = hours, y = satisfac)) + 
  geom_jitter (aes(color = proj)) + labs(title = "Employee Satisfaction vs. Hours Worked", x = "Hours per Month", y = "Satisfaction Level"))
ggplotly(ggplot(left, aes(x = hours, y = satisfac)) + 
  geom_jitter (aes(color = years)) + labs(title = "Employee Satisfaction vs. Hours Worked", x = "Hours per Month", y = "Satisfaction Level"))

We can see many of those who left belong to three distinct categories:
Those who work ~150 hours a month, have been with the company for 3 years, have only 2 projects, and were moderately unsatisfied with their jobs (group 1, in the middle left);
Those who work 250-300 hours a month, have been with the company for 4 or 5 years, have 6-7 projects, and were extremely unsatisfied with their jobs (group 2, in the bottom right);
and those who work 175-275 hours a month, have been with the company for 4-6 years, have 4-5 projects, and were very satisfied with their jobs (group 3, in the top center).
The plots also indicate that none of the companies more senior employees (7-10 years) left.
These groups persist in the next plot, which indicates that group 1 scored relatively low on their evaluation score while groups 2 and 3 were both quite high, and shows the lack of promotions among those who left, perhaps indicating lack of upwards mobility in the company as a potential reason for leaving. We can also see how group 1 contains the highest ratio of promotions, which seems odd considering their lower evaluation scores and seemingly smaller amounts of responsibility given their amount of projects and hours worked per month. It could be the case that HR is misjudging which employees are most suitable for promotions, which would also lead to lower satisfaction levels and more churn overall.

ggplotly(ggplot(left, aes(x = eval, y = satisfac)) + geom_jitter(aes(col = promote)) + labs(title = "Employee Satisfaction from Last Evaluation", x = "Last Evaluation Score", y = "Satisfaction Level"))
## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`

If we come back to the full dataset, we can try another plot to visualize those who left, which confirms that the three groups exist in years 3, 4, and 5 of seniority.

ggplotly(ggplot(hr, aes(x = time_spend_company, y = average_montly_hours, col = left)) + geom_jitter() + labs(title = "Time Spent at Company vs. Average Hours Worked", x = "Years at Company", y = "Monthly Hours Worked"))
## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`