# Project Summary
# 
# Frito-Lay contacted DDSAnalytics concerning it's specialization in talent management solutions. The team at Frito-Lay is especially interested in DDSAnalytics claimed abilities to harness data science by predicting attrition. DDSAnalytics has been tasked with conducting an analysis of existing employee data.
# 
# Specifically, DDSAnalytics will analyze factors that lead to attrition.


# Some project deliverables:
# 
# Top three factors that lead to attrition.
# 
# Identification of trends hidden in the data concerning job role specific trends
# 
# Robust visualization of said trends
# 
# Model for predicting attrition, alongside a .csv file containing predictions for attrition
# 
# Model for salary prediction, alongside a .csv file containing predictions for monthly salary
# 
# I am confident in the validity of this analysis and hope that Frito-Lay determines an appropriate course of action to take based on the work in this R markdown file.


#Execuitive Summary

# In this file we will examine a few questions of interest: 
#   
# 1) Top three factors that contribute to attrition.
# 
# 2) Identification of trends in the data concerning job role specific trends.
# 
# 3) Robust visualization of said trends
# 
# 4) A model for prediction of attrition
# 
# 5) A model for prediction of salaries

# General Conclusions for QOI1:
# 
# Attrition is a tricky variable of interest. When the two groups are plotted, there is visual evidence supporting those who stay with their jobs are very similar to those who leave their jobs. Luckily, using principal component anaylsis (PCA) I was able to derive some important trends between those who stayed and those who left jobs.
# 
# PCA is a technique for dimensionality reduction, and worked well with this dataset as 38 variables were present in the initial dataset. (CaseStudy2-data.txt) The initial impression of the data was that it was quite noisy and contained many similar variables attempting to measure the same thing, and a few variables which did not yeild any results significant for the attrition analysis or for any of the other QOIs.

#General Conclusions for QOI2: 

# From the components and the bi-plot, there are some interesting takeaways about job roles.
# 
# -Job roles tend to cluster around each other, indiciating they each have specific attributes that lend themselves to that job role (maybe not that profound, but certainly can be confirmed after looking at the plot)
# 
# -Managers and research directors tend to be more heavily associated with years of experience.
# 
# -Sales representatives and sales directors are associated with attrition in terms of abosolute of magnitude of their loadings.
# 
# -Interestingly, number of companies worked lands squarely in the middle of the pack and doesn't seem to have mich bearing on job role.
# 
# -Years with current manager doesn't seem to point to job roles in a specific direction, indicating should one wish to pivot, they should develop attributes more in line with the intended role (a decision that could be made based on this chart)


#General Conclusions for QOI3:

#Throughout this file, you will find many visualizations of the data.  Each plot has commentary and adequate visuals to understand the purpose of the chart.

#General Conclusions for QOI4:

# Classification of Attrition
# 
# Since attrition seemed to be hard to visually spot, this presented an interesting problem for classification. Could a classifier examine the data on a level that is unobservable by the human eye? Furthermore, the balance of the data was very off. Only 140 of the 830 employees were indicated as "Yes" for attrition, meaning that 730 of the employees indicated "No" for attrition. This ratio was about .19, meaning that if you guessed that 4 out of 5 employees were "No" for attrition, they could potientially be fairly accurate.
# 
# The solution was to oversample the data to restore balance between the no's and yes's.
# 
# For Bayesian classifcation, the preferred datatype is a factor,so the data was converted to all factors, including the outcome variable, "Attrition." Making this a binary classification model. The Bayesian classifer came in at 81% accuracy, 86% sensitivity, and 76% specificity.
# 
# An additional model was used for classification. Regularized Discriminant Analysis was the second method (This operator performs a regularized discriminant analysis (RDA). for nominal labels and numerical attributes. Discriminant analysis is used to determine which variables discriminate between two or more naturally occurring groups, it may have a descriptive or a predictive objective.). I felt these characteristics were preferred for this data. After analysis I felt the RDA model was better suited, and chose to submit it's predictions for attrition. RDA resulted in 85% accuracy, 87% sensitivity, and 83% specificity. Outperforming the Bayesian classifier on all metrics.

#General Conclusions for QOI5:

# Finally, we discuss salary predictions. A multiple linear regression model was used to achieve desired results. Initially a model was used using all predictors, however this yielded a R^2 value that was impossibly high. A plot was then generated using the Caret package which visually showed the importance of each varible in the model. The following variables were included in the final model: job level, totalworkingyears, yearswithcurrmanager, distance from home. The linear model was generated and each of the variables were considered significant at the P<0.05 level. Addtionally the model itself was significant with P<0.05.

#In conclusion, we looked at factors that contribute to attrition, examined job role specific trends, predicted attrition using RDA, and predicted monthly income using a linear model. I hope these results can be used in a meaningful way.  Thanks Frito-lay.
#############################################
#                                           #
#               Libraries                   #
#                                           #
#############################################

#This section contains libraries necessary for the functionality of this analysis

#Missingness maps
library(naniar)

#For reading .xlsx
library(readxl)

#General data manipulation
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.2     ✓ purrr   0.3.4
## ✓ tibble  3.0.4     ✓ stringr 1.4.0
## ✓ tidyr   1.1.2     ✓ forcats 0.5.0
## ✓ readr   1.4.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(psych)
## 
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha
#Plot generations
library(ggplot2)
library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
#Plotting PCA 
library(ggfortify)
library(pca3d)
library(FactoMineR)
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
#For oversampling
library(imbalance)

#For Ml
library(caret)
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
## 
##     lift
library(e1071)
#############################################
#                                           #
#           Read in data                    #
#                                           #
#############################################

#This section adds the files into the R project space so that the analysis can be performed. 

#Read in primary data ("CaseStudy2-data.txt")
primaryData <- read.table("CaseStudy2-data.txt",sep = ",",header = TRUE)
#Peek data
#head(primaryData)

###############Read in other datasets###############

#Case study 2 classification example dataset ("Case2PredictionsClassifyEXAMPLE.txt")
predictionsClassify <- read.table("Case2PredictionsClassifyEXAMPLE.txt",sep = ",",header = TRUE)

#Case study 2 predictions regression example dataset ("Case2PredictionsRegressEXAMPLE.txt")
predictionsRegression <- read.table("Case2PredictionsRegressEXAMPLE.txt",sep = ",",header = TRUE)

#Case study 2 compset no attrition dataset ("CaseStudy2CompSet No Attrition.txt")
compsetNoAttrition <- read.table("NoAttrition.txt",sep = ",",header = TRUE)

#Case study 2 compset no salary dataset ("CaseStudy2CompSet No Salary.xlsx")
compsetNoSalary <- read_excel("CaseStudy2CompSet No Salary.xlsx")
#############################################
#                                           #
#       EDA and Data Preperation            #
#                Part 1                     #
#############################################

#EDA is necessary to visually examine the data and begin to learn about the ways
#of this dataset.  First we will start with missingness.

#Examine primary dataset for missingness
primaryDatMiss <- vis_miss(primaryData)
primaryDatMiss

#Examine Case study 2 classification example dataset
predictionClassiftMiss <- vis_miss(predictionsClassify)
predictionClassiftMiss

#Case study 2 predictions regression example dataset
predictionsRegressionMiss <- vis_miss(predictionsRegression)
predictionsRegressionMiss

#Case study 2 compset no attrition dataset 
compsetNoAttritionMiss <- vis_miss(compsetNoAttrition)
compsetNoAttritionMiss

#Case study 2 compset no salary dataset
compsetNoSalaryMiss <- vis_miss(compsetNoSalary)
compsetNoSalaryMiss

#############################################
#                                           #
#       EDA and Data Preperation            #
#                Part 2                     #
#                                           #
#############################################

#EDA is necessary to visually examine the data and begin to learn about the ways of this dataset

#Convert all CHR variables to factors to examine pairs data
primaryData_factor <- primaryData %>%
  mutate_if(sapply(primaryData, is.character), as.factor)

#Convert all CHR variables to ints for COV 
primaryData_int <- primaryData_factor %>%
  mutate_if(sapply(primaryData_factor, is.factor), as.integer)


#Look at dataset from with a wide lense
describe(primaryData_factor)
##                          vars   n     mean      sd  median  trimmed     mad
## ID                          1 870   435.50  251.29   435.5   435.50  322.47
## Age                         2 870    36.83    8.93    35.0    36.39    8.90
## Attrition*                  3 870     1.16    0.37     1.0     1.08    0.00
## BusinessTravel*             4 870     2.60    0.68     3.0     2.75    0.00
## DailyRate                   5 870   815.23  401.12   817.5   817.85  514.46
## Department*                 6 870     2.27    0.53     2.0     2.27    0.00
## DistanceFromHome            7 870     9.34    8.14     7.0     8.28    7.41
## Education                   8 870     2.90    1.02     3.0     2.96    1.48
## EducationField*             9 870     3.24    1.32     3.0     3.09    1.48
## EmployeeCount              10 870     1.00    0.00     1.0     1.00    0.00
## EmployeeNumber             11 870  1029.83  604.79  1039.0  1031.45  807.28
## EnvironmentSatisfaction    12 870     2.70    1.10     3.0     2.75    1.48
## Gender*                    13 870     1.59    0.49     2.0     1.62    0.00
## HourlyRate                 14 870    65.61   20.13    66.0    65.63   25.20
## JobInvolvement             15 870     2.72    0.70     3.0     2.73    0.00
## JobLevel                   16 870     2.04    1.09     2.0     1.87    1.48
## JobRole*                   17 870     5.53    2.46     6.0     5.70    2.97
## JobSatisfaction            18 870     2.71    1.11     3.0     2.76    1.48
## MaritalStatus*             19 870     2.09    0.72     2.0     2.11    1.48
## MonthlyIncome              20 870  6390.26 4597.70  4945.5  5582.97 3304.72
## MonthlyRate                21 870 14325.62 7108.38 14074.5 14285.47 9273.66
## NumCompaniesWorked         22 870     2.73    2.52     2.0     2.40    1.48
## Over18*                    23 870     1.00    0.00     1.0     1.00    0.00
## OverTime*                  24 870     1.29    0.45     1.0     1.24    0.00
## PercentSalaryHike          25 870    15.20    3.68    14.0    14.78    2.97
## PerformanceRating          26 870     3.15    0.36     3.0     3.06    0.00
## RelationshipSatisfaction   27 870     2.71    1.10     3.0     2.76    1.48
## StandardHours              28 870    80.00    0.00    80.0    80.00    0.00
## StockOptionLevel           29 870     0.78    0.86     1.0     0.65    1.48
## TotalWorkingYears          30 870    11.05    7.51    10.0    10.19    5.93
## TrainingTimesLastYear      31 870     2.83    1.27     3.0     2.76    1.48
## WorkLifeBalance            32 870     2.78    0.71     3.0     2.80    0.00
## YearsAtCompany             33 870     6.96    6.02     5.0     6.00    4.45
## YearsInCurrentRole         34 870     4.20    3.64     3.0     3.83    4.45
## YearsSinceLastPromotion    35 870     2.17    3.19     1.0     1.47    1.48
## YearsWithCurrManager       36 870     4.14    3.57     3.0     3.81    4.45
##                           min   max range  skew kurtosis     se
## ID                          1   870   869  0.00    -1.20   8.52
## Age                        18    60    42  0.42    -0.31   0.30
## Attrition*                  1     2     1  1.84     1.40   0.01
## BusinessTravel*             1     3     2 -1.43     0.62   0.02
## DailyRate                 103  1499  1396 -0.03    -1.18  13.60
## Department*                 1     3     2  0.16    -0.48   0.02
## DistanceFromHome            1    29    28  0.91    -0.35   0.28
## Education                   1     5     4 -0.27    -0.63   0.03
## EducationField*             1     6     5  0.55    -0.69   0.04
## EmployeeCount               1     1     0   NaN      NaN   0.00
## EmployeeNumber              1  2064  2063 -0.03    -1.24  20.50
## EnvironmentSatisfaction     1     4     3 -0.28    -1.24   0.04
## Gender*                     1     2     1 -0.38    -1.86   0.02
## HourlyRate                 30   100    70  0.01    -1.21   0.68
## JobInvolvement              1     4     3 -0.48     0.26   0.02
## JobLevel                    1     5     4  1.03     0.44   0.04
## JobRole*                    1     9     8 -0.40    -1.16   0.08
## JobSatisfaction             1     4     3 -0.30    -1.27   0.04
## MaritalStatus*              1     3     2 -0.14    -1.08   0.02
## MonthlyIncome            1081 19999 18918  1.39     1.14 155.88
## MonthlyRate              2094 26997 24903  0.04    -1.21 241.00
## NumCompaniesWorked          0     9     9  1.00    -0.07   0.09
## Over18*                     1     1     0   NaN      NaN   0.00
## OverTime*                   1     2     1  0.93    -1.14   0.02
## PercentSalaryHike          11    25    14  0.83    -0.28   0.12
## PerformanceRating           3     4     1  1.94     1.76   0.01
## RelationshipSatisfaction    1     4     3 -0.30    -1.24   0.04
## StandardHours              80    80     0   NaN      NaN   0.00
## StockOptionLevel            0     3     3  1.03     0.50   0.03
## TotalWorkingYears           0    40    40  1.13     1.09   0.25
## TrainingTimesLastYear       0     6     6  0.52     0.48   0.04
## WorkLifeBalance             1     4     3 -0.57     0.46   0.02
## YearsAtCompany              0    40    40  1.62     3.36   0.20
## YearsInCurrentRole          0    18    18  0.87     0.31   0.12
## YearsSinceLastPromotion     0    15    15  1.99     3.71   0.11
## YearsWithCurrManager        0    17    17  0.73    -0.15   0.12
####Look at some of the variables as histograms####

#Histogram of age
hist(primaryData_factor$Age)

#Histogram of working years
hist(primaryData_factor$TotalWorkingYears)

#Histogram of Monthly Income
hist(primaryData_factor$MonthlyIncome)

####Look at some of the variables as plots####

#Look at relationship between monthly income and wokring years
plot(primaryData_factor$TotalWorkingYears,primaryData_factor$MonthlyIncome)

#Look at relationship between monthly income and attrition
plot(primaryData_factor$Attrition,primaryData_factor$MonthlyIncome)

#Look at relationship between distance from home and attrition
plot(primaryData_factor$Attrition,primaryData_factor$DistanceFromHome)

#Look at relationship between years since last promotion and attrition
plot(primaryData_factor$Attrition,primaryData_factor$YearsSinceLastPromotion)

#Look at relationship between distance job satisfaction and attrition
plot(primaryData_factor$Attrition,primaryData_factor$JobSatisfaction)

#Look at relationship between environment satisfaction and attrition
plot(primaryData_factor$Attrition,primaryData_factor$EnvironmentSatisfaction)

#Look at relationship between marital status and attrition
plot(primaryData_factor$Attrition~primaryData_factor$MaritalStatus)

#Look at Attrition levels
plot(primaryData_factor$Attrition)

plot(primaryData_factor$Attrition,primaryData_factor$DistanceFromHome)

#############################################
#                                           #
#            Identify factors               #
#              that lead to                 #
#                attrition                  #
#                                           #
#                                           #
#############################################

#First method will be to use PCA (principal component analysis)

#Lets look at the covairance matrix of the attrition dataset and attempt to scale it all

#A bit of data preperation
pcaDat <- primaryData_int[,c(2:4,6:9,12:13,15:20,22:36)]
#pcaDat

PCAdat_factor <- primaryData_int[,c(2:4,6:9,12:13,15:20,22:36)]
#PCAdat_factor[,c(2,3,4,7,9,12,14,17,18)] <- as.integer(PCAdat_factor[,c(2,3,4,7,9,12,14,17,18)])
PCAdat_factor$Attrition <- as.factor(PCAdat_factor$Attrition)
PCAdat_factor$Attrition <- factor(PCAdat_factor$Attrition,levels = c(1,2),labels = c("No","Yes"))


S <- cov(pcaDat[,c(1,3:30)])
#S

#Examine total variance, (sum of the eigenvalues of S)
sum(diag(S))
## [1] 21139122
#Compute the eigenvalues and corresponding eigenvectors of S
s.eigen <- eigen(S)
s.eigen
## eigen() decomposition
## $values
##  [1] 2.113888e+07 7.202456e+01 6.631988e+01 4.778148e+01 1.354538e+01
##  [6] 1.123709e+01 6.217654e+00 5.647596e+00 4.918368e+00 3.729123e+00
## [11] 3.683370e+00 1.833972e+00 1.514393e+00 1.251257e+00 1.206319e+00
## [16] 1.167420e+00 1.008933e+00 9.446959e-01 4.966267e-01 4.850435e-01
## [21] 4.352707e-01 2.436575e-01 2.041076e-01 2.002598e-01 1.623694e-01
## [26] 8.334766e-02 5.027880e-02 8.344090e-17 1.370361e-17
## 
## $vectors
##                 [,1]          [,2]          [,3]          [,4]          [,5]
##  [1,]  -9.401981e-04  8.396407e-01  0.0432707792  0.4313026623  4.073105e-02
##  [2,]  -9.211448e-06 -3.805453e-03  0.0047807533  0.0026532501 -1.242759e-03
##  [3,]  -5.187284e-06 -5.069297e-03  0.0002980445 -0.0030806856 -5.260841e-03
##  [4,]   1.179909e-05  7.748396e-02 -0.9946588258 -0.0546109738 -2.985425e-02
##  [5,]  -2.828814e-05  2.015820e-02 -0.0050438331  0.0100220391  1.257185e-03
##  [6,]   1.137077e-05 -4.963654e-03 -0.0024427353 -0.0083708695 -1.529578e-02
##  [7,]   4.035640e-06 -5.245451e-04  0.0052722411  0.0007602070  1.260042e-03
##  [8,]   5.896832e-06 -2.677144e-03  0.0003590958  0.0042171132 -5.428142e-04
##  [9,]  -6.846132e-08  8.390746e-05  0.0002049694  0.0035785075  3.658734e-03
## [10,]  -2.256332e-04  6.368421e-03 -0.0029061268 -0.0083282179 -3.320457e-03
## [11,]   4.022370e-05 -3.234075e-02  0.0015494726  0.0065496145 -6.481275e-03
## [12,]   1.288358e-05  1.223171e-03  0.0033199846 -0.0060034548  6.592689e-03
## [13,]   1.125515e-05 -6.511251e-03  0.0025223384 -0.0018241708 -4.518813e-04
## [14,]  -9.999984e-01 -1.584842e-03 -0.0001567469  0.0004883040  8.128895e-05
## [15,]  -8.546091e-05  5.740232e-02  0.0158768744  0.1260538381 -4.970948e-02
## [16,]  1.064490e-109  0.000000e+00  0.0000000000  0.0000000000  9.926167e-24
## [17,]   2.494422e-06  3.441739e-04 -0.0034102188  0.0016395945 -9.018317e-04
## [18,]   4.306096e-05 -9.707445e-03 -0.0322751690  0.0206575627  9.865436e-01
## [19,]   3.367402e-06 -9.181222e-04 -0.0019989600 -0.0002686572  7.502445e-02
## [20,]   9.315193e-07 -2.927152e-04 -0.0048408936 -0.0011660781 -1.574782e-02
## [21,]   0.000000e+00  0.000000e+00  0.0000000000  0.0000000000  0.000000e+00
## [22,]  -3.523976e-06  5.673230e-03 -0.0070959636 -0.0024419329 -7.753158e-04
## [23,]  -1.272263e-03  4.154281e-01  0.0326455691 -0.1785521747 -8.989945e-02
## [24,]   1.079942e-05 -3.809173e-03  0.0061966977 -0.0066279922  2.700462e-03
## [25,]  -3.221740e-06 -8.389682e-04  0.0013116356 -0.0044667895  2.145601e-03
## [26,]  -6.434996e-04  2.544075e-01  0.0605721990 -0.6532012033  3.725116e-02
## [27,]  -2.864163e-04  1.327036e-01  0.0283819890 -0.3693176763  6.125551e-02
## [28,]  -2.189045e-04  1.129106e-01  0.0257659248 -0.2556517517 -3.517556e-02
## [29,]  -2.553805e-04  1.276234e-01  0.0323024324 -0.3659634601  3.137862e-02
##                [,6]          [,7]          [,8]          [,9]         [,10]
##  [1,] -3.229959e-01  7.535130e-03  2.089766e-03  2.992892e-03  3.537140e-03
##  [2,]  3.197321e-03  9.805932e-03  1.207093e-02  9.846941e-03 -2.938065e-02
##  [3,] -1.643514e-02 -1.370344e-01 -3.724444e-03  4.756016e-02  4.084171e-03
##  [4,] -7.533984e-03 -6.955734e-03  1.033473e-03 -1.103503e-02  6.364807e-03
##  [5,] -9.845133e-03 -5.862410e-02  9.927089e-03 -4.961342e-02  4.592008e-02
##  [6,]  1.247443e-02  1.044807e-02 -2.336635e-02 -2.595844e-02 -2.329528e-02
##  [7,] -1.971441e-03  2.109469e-02 -1.153614e-02 -3.930292e-02 -5.016946e-02
##  [8,]  7.003552e-04  7.437720e-03 -9.304167e-03  3.607076e-03  2.204658e-03
##  [9,] -6.141960e-03  2.131993e-05  1.422389e-02 -1.651721e-02 -8.737925e-03
## [10,]  6.236150e-03 -2.259567e-03  8.045168e-03  5.329261e-03  2.023734e-03
## [11,] -9.020906e-02 -9.191497e-01  9.709417e-04  3.446274e-01  1.941342e-02
## [12,] -2.087285e-02  1.275610e-02  1.165938e-02  4.825469e-02  6.866185e-03
## [13,]  6.156236e-03 -1.113503e-02  1.011628e-03  9.844760e-03  3.290883e-02
## [14,] -5.655330e-04  1.580704e-05 -2.019408e-05 -1.593817e-04  4.456081e-05
## [15,]  3.271796e-01 -2.657688e-01  8.602034e-02 -6.784335e-01  1.644631e-01
## [16,]  0.000000e+00 -4.336809e-19  0.000000e+00  0.000000e+00  3.330669e-16
## [17,] -3.449006e-03 -1.234485e-02 -4.713366e-03  8.399768e-03 -2.959770e-03
## [18,]  1.224846e-01 -1.828386e-02 -5.155613e-02 -5.157743e-04  3.044700e-02
## [19,]  1.086882e-02  1.536308e-03 -3.641103e-03 -2.900677e-03  2.292358e-03
## [20,] -7.839785e-03  6.102349e-03 -2.517657e-02 -3.520508e-02  2.147959e-03
## [21,] -7.888609e-31  0.000000e+00  5.293956e-23  1.058791e-22  0.000000e+00
## [22,] -2.311988e-03  1.312988e-03  1.673390e-02 -3.226796e-02 -3.794010e-02
## [23,]  8.255728e-01 -2.043910e-02  2.437325e-02  2.239508e-01 -6.527887e-02
## [24,] -1.029510e-02  4.626832e-02  2.768067e-02  5.360729e-02  9.720465e-03
## [25,]  1.142715e-03 -1.461450e-02 -3.437078e-03 -1.995169e-02 -3.353552e-02
## [26,] -1.698367e-01  1.245281e-01  4.346492e-03  2.735696e-01  7.232788e-02
## [27,] -1.585675e-01 -1.608789e-01  2.043023e-01 -3.862282e-01 -7.414125e-01
## [28,] -6.544328e-02 -1.004406e-01 -8.779913e-01 -2.633252e-01  1.354852e-01
## [29,] -1.555538e-01 -8.021979e-02  4.166050e-01 -2.488217e-01  6.197683e-01
##               [,11]         [,12]         [,13]         [,14]         [,15]
##  [1,]  1.147001e-02  0.0134008651 -3.684404e-03 -5.956499e-03  1.474633e-03
##  [2,] -4.771168e-03 -0.0110379895  8.618643e-03 -3.500074e-03 -1.769568e-02
##  [3,] -6.877956e-04  0.0184422565  1.968129e-02  9.619506e-03  8.780098e-04
##  [4,]  1.386443e-02  0.0012657782  5.454868e-03  3.336534e-03 -6.029276e-03
##  [5,]  1.323837e-02 -0.0473753234 -5.882172e-02  1.555496e-01  3.054018e-03
##  [6,] -1.602896e-02  0.8350680224 -5.300345e-01  6.469814e-02  4.380086e-02
##  [7,] -2.292377e-03  0.0608024349 -1.143633e-01 -3.758465e-01 -7.030205e-01
##  [8,]  3.584595e-03 -0.0058623055  1.303934e-03 -7.486061e-04 -1.153997e-02
##  [9,] -3.207720e-02 -0.0186702404  1.022417e-02 -4.797689e-02 -5.534836e-03
## [10,] -4.494521e-03 -0.0041154984 -7.742171e-03 -9.430316e-03 -1.331546e-02
## [11,]  2.700393e-02  0.0319336277  6.285608e-03 -2.992885e-02 -2.122662e-02
## [12,]  3.225101e-02 -0.0958700611 -1.572771e-01  7.217060e-01  8.356013e-02
## [13,]  2.489376e-03  0.0050071408 -2.798302e-02  3.114836e-02  1.766074e-01
## [14,]  4.519855e-05  0.0000149390  1.091658e-06  1.749731e-05 -1.248003e-06
## [15,]  5.449866e-01  0.0089818893  2.563011e-02  4.139404e-02 -2.162964e-02
## [16,] -2.775558e-17  0.0000000000  0.000000e+00 -1.249001e-16 -2.215242e-15
## [17,]  1.093419e-02 -0.0071067415 -2.380428e-02 -1.363983e-02 -1.583084e-02
## [18,]  4.421172e-03  0.0131499732 -5.094472e-03 -9.164968e-03  8.090935e-03
## [19,] -4.637060e-03 -0.0009927603 -5.337305e-04 -5.489078e-03  5.045164e-03
## [20,]  5.447652e-02  0.0698452329  3.777076e-02 -5.201235e-01  6.335476e-01
## [21,] -6.776264e-21  0.0000000000  0.000000e+00  2.775558e-17  1.387779e-17
## [22,] -6.763669e-03 -0.0171018578  6.384426e-02 -1.669776e-03 -2.291756e-01
## [23,] -2.187700e-01 -0.0046836280  6.919781e-03 -4.191190e-03  1.065715e-02
## [24,]  3.568724e-02  0.5273148440  8.145853e-01  1.583162e-01 -6.188003e-02
## [25,] -1.088453e-03  0.0359946173  3.968497e-03 -5.399922e-02 -4.471489e-02
## [26,]  6.104308e-01 -0.0100418039 -4.360202e-02 -3.032186e-02 -2.961889e-02
## [27,] -2.000013e-01 -0.0073443239  3.719166e-02  4.192125e-02  5.327310e-02
## [28,] -2.047885e-01 -0.0030973766  5.138066e-02  3.901287e-02 -9.059198e-03
## [29,] -4.387412e-01  0.0205509305  2.248308e-03 -1.866757e-02 -8.049098e-03
##               [,16]         [,17]         [,18]         [,19]         [,20]
##  [1,] -2.842137e-03 -1.299476e-02  2.099564e-02  2.709308e-03  1.350731e-03
##  [2,] -5.222929e-02 -3.214177e-02 -4.434323e-02 -3.430609e-01  2.162402e-01
##  [3,]  4.261608e-03 -8.081380e-03  1.135225e-02  1.421607e-02 -4.392098e-02
##  [4,]  3.728393e-03 -1.116310e-02  4.795767e-03 -1.961446e-03  3.957106e-03
##  [5,] -9.551138e-02  2.111031e-01 -9.503023e-01  1.209392e-02 -5.545181e-02
##  [6,] -9.278843e-02  5.366710e-02  2.093078e-02 -2.768164e-02  5.894705e-03
##  [7,]  5.422525e-01 -1.378427e-01 -1.457831e-01 -7.496367e-02 -3.275650e-02
##  [8,]  1.252348e-02  3.720333e-02  1.238518e-03 -3.755326e-02  4.054162e-03
##  [9,] -3.480388e-02  1.007313e-01 -2.608770e-02 -4.077662e-01  8.290799e-01
## [10,]  1.150485e-02 -6.000996e-03 -8.239943e-03  4.313163e-02 -3.124804e-02
## [11,]  2.229240e-02  1.006192e-02  3.149860e-02 -1.671704e-02  9.081087e-03
## [12,]  6.400163e-01  1.057017e-01  8.559493e-02 -2.450147e-02  5.952404e-02
## [13,]  7.416078e-02 -5.564274e-01 -1.359635e-01  2.004255e-02  6.789958e-02
## [14,]  1.043458e-06 -5.960857e-06  2.008845e-05 -6.086893e-06  6.940940e-07
## [15,] -4.774265e-03 -2.953948e-02  6.480488e-02 -1.716292e-02  1.222358e-02
## [16,] -1.200429e-15  3.330669e-16  1.595946e-16 -8.723577e-14  2.402384e-13
## [17,]  2.996608e-02 -8.989855e-03  1.517571e-02 -1.193620e-02 -1.901578e-02
## [18,]  4.419810e-03  6.595698e-03  9.715661e-04 -2.655897e-03 -3.049354e-03
## [19,] -1.750721e-03 -5.504097e-03  6.829567e-03  6.554180e-03  2.554481e-03
## [20,]  4.932957e-01  2.498167e-01 -8.034365e-02 -3.578798e-02 -3.598881e-02
## [21,]  0.000000e+00  3.816392e-17  3.469447e-17 -5.551115e-17  8.326673e-17
## [22,] -6.376254e-02  7.290126e-01  1.595359e-01  1.314438e-02 -6.823963e-02
## [23,]  2.500938e-02  1.329703e-02 -1.835545e-02 -1.977976e-04 -1.150901e-03
## [24,]  1.154119e-01 -3.474830e-02 -7.454239e-02 -2.478065e-02  3.008082e-03
## [25,]  5.125390e-02  3.781199e-02 -2.985132e-02  8.377226e-01  4.923170e-01
## [26,] -4.364573e-02  6.121014e-03 -7.424845e-03 -3.432430e-03  2.279600e-02
## [27,]  2.387120e-03 -4.398383e-02 -8.561952e-03 -1.299691e-02 -3.257698e-02
## [28,]  7.841824e-03 -7.322111e-04  1.604747e-02 -1.237879e-02  6.554358e-03
## [29,]  3.807681e-02  5.377489e-03  3.957304e-02  8.405673e-03 -4.442225e-03
##               [,21]         [,22]         [,23]         [,24]         [,25]
##  [1,]  4.493366e-03  6.963262e-04 -5.331090e-03 -2.937504e-04 -3.600803e-04
##  [2,]  9.052061e-01 -1.767598e-02 -6.900817e-03 -6.977339e-02 -4.775512e-02
##  [3,] -1.749646e-02  7.206594e-02 -1.693754e-01 -1.107667e-01 -8.379556e-01
##  [4,]  4.265441e-03 -2.276895e-03  2.284050e-04 -4.285717e-03  2.271958e-03
##  [5,] -2.228602e-02 -6.255832e-04  1.925690e-02  1.113411e-02 -5.483468e-03
##  [6,]  1.862252e-03  2.367508e-03 -1.430918e-02  5.456317e-03  1.463106e-03
##  [7,] -1.690400e-02  1.293346e-02 -1.338159e-02 -3.013931e-02  9.949464e-03
##  [8,] -2.494631e-02 -9.097341e-01 -3.749091e-01  1.586611e-01 -4.155776e-02
##  [9,] -3.534525e-01  3.827558e-02  6.618295e-03  2.713989e-02 -5.160334e-02
## [10,] -3.776236e-02  4.726039e-02 -2.630809e-02 -6.256895e-02 -4.735938e-01
## [11,]  5.767522e-03 -1.517766e-02  1.773982e-02  1.391674e-03  1.277377e-01
## [12,]  2.376817e-02 -6.473147e-05  2.051658e-02 -2.667061e-02  1.649503e-04
## [13,]  5.866692e-03  2.960368e-01 -6.351567e-01  3.559064e-01  8.885003e-02
## [14,]  2.069589e-07 -8.687530e-06  3.321766e-06  1.794193e-05  1.091070e-04
## [15,]  6.932274e-03 -6.719070e-03  6.963782e-03 -7.276024e-03 -3.880054e-03
## [16,] -2.467471e-14  1.066813e-12  2.688683e-12 -4.905907e-11  2.380814e-12
## [17,]  6.264699e-02 -2.611194e-02  4.482771e-01  8.679730e-01 -1.923778e-01
## [18,]  3.376066e-03  8.509333e-06 -1.091605e-03  1.802943e-03 -2.549961e-03
## [19,] -2.876474e-04  1.315199e-02 -3.575840e-03 -2.380475e-02 -3.591601e-02
## [20,]  3.947544e-02  8.593591e-03 -2.536753e-03 -1.165414e-02 -2.729014e-03
## [21,]  0.000000e+00  9.992007e-16  3.066991e-15 -5.720684e-14  3.219647e-15
## [22,]  6.923052e-02  2.679489e-01 -4.720797e-01  2.648506e-01  6.266134e-02
## [23,] -4.619479e-03 -1.926605e-03  2.268588e-03  3.185177e-03  3.809941e-03
## [24,] -1.010152e-02 -7.740888e-03  2.574843e-02  1.446315e-02  1.030488e-02
## [25,]  2.003082e-01 -4.384782e-02  1.659927e-02 -6.889463e-03 -9.745256e-03
## [26,] -5.290172e-03  3.941570e-03 -8.346251e-03 -2.868834e-03  9.479320e-04
## [27,] -1.775642e-02 -1.273346e-02  4.569615e-03  3.511487e-03  3.919802e-03
## [28,]  1.466629e-02  5.255089e-03  1.090469e-03  5.729670e-04  3.879178e-03
## [29,]  2.686038e-02 -1.304595e-02  1.062990e-02  2.259752e-03  4.406633e-03
##               [,26]         [,27]         [,28]         [,29]
##  [1,]  5.350374e-04  1.142666e-03  0.000000e+00  0.000000e+00
##  [2,]  3.326556e-02 -2.460820e-03  3.337081e-12 -2.762960e-17
##  [3,] -4.760896e-01 -1.197290e-02  2.970922e-12  3.274283e-16
##  [4,] -2.931618e-03  5.366972e-04  2.143718e-13 -1.954680e-18
##  [5,] -9.770787e-03  8.894526e-03 -5.238158e-13  1.333323e-17
##  [6,]  4.958973e-03  1.523924e-03 -2.955523e-13  1.063169e-17
##  [7,] -2.028548e-02  3.448348e-03  1.476708e-12 -6.354632e-17
##  [8,]  2.860791e-02  1.218275e-02 -9.851894e-12 -1.969212e-16
##  [9,]  6.743430e-03 -1.453978e-03 -1.150901e-12  1.335552e-16
## [10,]  8.719351e-01 -5.886102e-02  2.057010e-12  7.985552e-17
## [11,]  6.877782e-02  5.688899e-03  2.820861e-13 -5.631565e-17
## [12,]  3.655502e-03  3.828369e-03  1.379233e-12 -1.668581e-17
## [13,]  3.688242e-02  7.500738e-04 -1.862088e-11  8.800236e-17
## [14,] -1.746019e-04  1.374186e-05 -6.493016e-16 -1.998304e-20
## [15,]  3.835652e-03  6.548062e-04  3.642040e-13 -3.910169e-19
## [16,]  1.652567e-13 -2.531308e-14 -9.999993e-01 -1.171312e-03
## [17,] -2.439032e-02  1.709908e-02 -4.187186e-11  2.257507e-16
## [18,] -2.785309e-03 -7.595199e-02 -9.658145e-14 -8.433643e-19
## [19,]  4.517298e-02  9.949513e-01  1.068958e-12  2.632821e-17
## [20,] -9.297816e-04 -2.001343e-03  5.589783e-13 -3.613565e-17
## [21,]  3.330669e-16 -2.775558e-17 -1.171312e-03  9.999993e-01
## [22,]  2.895642e-02  6.244777e-03 -1.384129e-11  6.365102e-17
## [23,] -1.158378e-02 -9.928065e-04 -1.452353e-13 -1.359336e-18
## [24,]  1.673529e-02  3.224574e-03 -6.182496e-13 -1.070951e-17
## [25,] -1.999751e-02 -5.372413e-03  3.497778e-13  4.649921e-17
## [26,] -4.843320e-03  2.549414e-03  1.299294e-13  4.537135e-18
## [27,]  6.534950e-04 -1.822765e-03 -1.705086e-13 -4.965373e-18
## [28,]  9.264601e-03 -1.716278e-03 -6.449692e-15 -9.379682e-19
## [29,] -5.629134e-03 -2.704857e-03 -8.902551e-14 -8.270604e-18
#Use eigenvalues to find the proportion of the total variance explained by the components
for (s in s.eigen$values) {
  print(s / sum(s.eigen$values))
}
## [1] 0.9999883
## [1] 3.407169e-06
## [1] 3.137305e-06
## [1] 2.260334e-06
## [1] 6.407729e-07
## [1] 5.31578e-07
## [1] 2.941302e-07
## [1] 2.671632e-07
## [1] 2.326666e-07
## [1] 1.764086e-07
## [1] 1.742442e-07
## [1] 8.675726e-08
## [1] 7.163936e-08
## [1] 5.919154e-08
## [1] 5.70657e-08
## [1] 5.522556e-08
## [1] 4.772826e-08
## [1] 4.468946e-08
## [1] 2.349325e-08
## [1] 2.29453e-08
## [1] 2.059077e-08
## [1] 1.152638e-08
## [1] 9.655442e-09
## [1] 9.473419e-09
## [1] 7.680993e-09
## [1] 3.942816e-09
## [1] 2.378472e-09
## [1] 3.947226e-24
## [1] 6.482584e-25
#It would seem that the first two principal components account for nearly 99% of the total variance

#Create a plot to determine the proportion of variance explained by each subsequential eigenvalue
plot(s.eigen$values, xlab="Eigenvalue Number",ylab="Eigenvalue Size",main="Scree Graph")+lines(s.eigen$values)

## integer(0)
####Use R packages to confirm results####
quitters.pca <- prcomp(pcaDat[,c(1,3:30)])
#quitters.pca

#Components reported by package are equal to earlier computations

#Look at proportion of variance explained by components
summary(quitters.pca)
## Importance of components:
##                         PC1   PC2   PC3   PC4  PC5   PC6   PC7   PC8   PC9
## Standard deviation     4598 8.487 8.144 6.912 3.68 3.352 2.494 2.376 2.218
## Proportion of Variance    1 0.000 0.000 0.000 0.00 0.000 0.000 0.000 0.000
## Cumulative Proportion     1 1.000 1.000 1.000 1.00 1.000 1.000 1.000 1.000
##                         PC10  PC11  PC12  PC13  PC14  PC15 PC16  PC17  PC18
## Standard deviation     1.931 1.919 1.354 1.231 1.119 1.098 1.08 1.004 0.972
## Proportion of Variance 0.000 0.000 0.000 0.000 0.000 0.000 0.00 0.000 0.000
## Cumulative Proportion  1.000 1.000 1.000 1.000 1.000 1.000 1.00 1.000 1.000
##                          PC19   PC20   PC21   PC22   PC23   PC24  PC25   PC26
## Standard deviation     0.7047 0.6965 0.6598 0.4936 0.4518 0.4475 0.403 0.2887
## Proportion of Variance 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.000 0.0000
## Cumulative Proportion  1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.000 1.0000
##                          PC27      PC28      PC29
## Standard deviation     0.2242 4.019e-13 4.019e-13
## Proportion of Variance 0.0000 0.000e+00 0.000e+00
## Cumulative Proportion  1.0000 1.000e+00 1.000e+00
#Plot the principal components
pca.plot <- autoplot(quitters.pca,data=PCAdat_factor,colour='Attrition',loadings=TRUE,loadings.colour="blue",loadings.label=TRUE) 
## Warning: `select_()` is deprecated as of dplyr 0.7.0.
## Please use `select()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
pca.plot

#The points of the two groups are clustered somewhat, however there appears to be more attrition at the right of the graph.  The data does not appear to depart widely from multivariate normality.  

#Create new df for R
rDat <- pcaDat[,c(1,3:16,18:21,23:30)]
#rDat

#Lets use R instead of S 
R <- cor(rDat)
#R

#Again find Eigenvalues and Eigenvectors for R
r.eigen <- eigen(R)
#r.eigen


#Compute proportion of total variance explained by the eigenvalues
for (r in r.eigen$values) {
  print(r/sum(r.eigen$values))
  
}
## [1] 0.1735757
## [1] 0.07028831
## [1] 0.06750148
## [1] 0.06325877
## [1] 0.05967773
## [1] 0.04353646
## [1] 0.04226411
## [1] 0.04068733
## [1] 0.03963922
## [1] 0.03889302
## [1] 0.03744872
## [1] 0.03601573
## [1] 0.03450615
## [1] 0.03405034
## [1] 0.03339357
## [1] 0.03233742
## [1] 0.03131863
## [1] 0.02612012
## [1] 0.02089231
## [1] 0.0185336
## [1] 0.01274802
## [1] 0.01151479
## [1] 0.01029506
## [1] 0.008130533
## [1] 0.007021581
## [1] 0.004773596
## [1] 0.001577705
#PCA (scaled)
quitters.pca.scaled <- prcomp(rDat,scale = TRUE)
#quitters.pca.scaled

#Summary of scaled 
summary(quitters.pca.scaled)
## Importance of components:
##                           PC1     PC2    PC3     PC4     PC5     PC6     PC7
## Standard deviation     2.1648 1.37760 1.3500 1.30690 1.26937 1.08420 1.06824
## Proportion of Variance 0.1736 0.07029 0.0675 0.06326 0.05968 0.04354 0.04226
## Cumulative Proportion  0.1736 0.24386 0.3114 0.37462 0.43430 0.47784 0.52010
##                            PC8     PC9    PC10    PC11    PC12    PC13    PC14
## Standard deviation     1.04812 1.03453 1.02475 1.00554 0.98612 0.96523 0.95883
## Proportion of Variance 0.04069 0.03964 0.03889 0.03745 0.03602 0.03451 0.03405
## Cumulative Proportion  0.56079 0.60043 0.63932 0.67677 0.71279 0.74729 0.78134
##                           PC15    PC16    PC17    PC18    PC19    PC20    PC21
## Standard deviation     0.94954 0.93440 0.91957 0.83979 0.75106 0.70739 0.58668
## Proportion of Variance 0.03339 0.03234 0.03132 0.02612 0.02089 0.01853 0.01275
## Cumulative Proportion  0.81474 0.84707 0.87839 0.90451 0.92541 0.94394 0.95669
##                           PC22   PC23    PC24    PC25    PC26    PC27
## Standard deviation     0.55758 0.5272 0.46853 0.43541 0.35901 0.20639
## Proportion of Variance 0.01151 0.0103 0.00813 0.00702 0.00477 0.00158
## Cumulative Proportion  0.96820 0.9785 0.98663 0.99365 0.99842 1.00000
#Plot elbow plot to determine components required
plot(r.eigen$values, xlab="Eigenvalue Number",ylab="Eigenvalue Size",main="Scree Graph")+lines(r.eigen$values)

## integer(0)
#Look at Visualized data
pca.plot.scaled <- autoplot(quitters.pca.scaled,data = PCAdat_factor,colour='Attrition',loadings=TRUE,loadings.colour="blue",loadings.label=TRUE)
pca.plot.scaled

#Check out a 3D plot to attempt to determine best components to use
#pca3d(quitters.pca.scaled,group=PCAdat_factor$Attrition,legend = "topleft",biplot = TRUE,biplot.vars = 3)


####Plots of PCs####

#First plot using autoplot
autoplot(quitters.pca.scaled,data = PCAdat_factor,colour='Attrition',loadings=TRUE,loadings.colour="blue",loadings.label=TRUE,x=1,y=2)

#Other plots using ggbiplot
#biplot with ellipse, pc1,2
p <- ggbiplot::ggbiplot(quitters.pca.scaled,ellipse = TRUE,circle = FALSE,groups = PCAdat_factor$Attrition,choices = 1:2,varname.size = 3,varname.adjust = 2,alpha = 0.5)+ggtitle("PCA of Attrition Dataset PC1,PC2")+theme_minimal()
## ------------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## ------------------------------------------------------------------------------
## 
## Attaching package: 'plyr'
## The following object is masked from 'package:purrr':
## 
##     compact
## The following objects are masked from 'package:dplyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize
## 
## Attaching package: 'scales'
## The following objects are masked from 'package:psych':
## 
##     alpha, rescale
## The following object is masked from 'package:purrr':
## 
##     discard
## The following object is masked from 'package:readr':
## 
##     col_factor
p

####PC1 Interpretation####
#High positive loadings for years at company, years in current role, and .4 (highest) in total working years,years with current manager, monthly income, job level
#This component seems to measure employee stability, and seemingly perhaps overall success at a given company
#This component is the primary PC and explains 17.35% of the variation in the data

####PC2 Interpretation####
#Large negative loadings on age and number of companies employee worked at, this component measures an employees attrition based on age
#This component is the secondary PC and explains 7.0% of the variation in the data

#biplot without ellipse pc1,2
p2 <- ggbiplot::ggbiplot(quitters.pca.scaled,ellipse = FALSE,circle = FALSE,groups = PCAdat_factor$Attrition,choices = 1:2,varname.size = 3,varname.adjust = 2,alpha = 0.5)+ggtitle("PCA of Attrition Dataset PC1,PC2")+theme_minimal()
p2

####PC2 Interpretation####
#Large negative loadings on age and number of companies employee worked at, this component measures an employees attrition based on age
#This component is the secondary PC and explains 7.0% of the variation in the data

#biplot with ellipse, pc2,3
p3 <- ggbiplot::ggbiplot(quitters.pca.scaled,ellipse = TRUE,circle = FALSE,groups = PCAdat_factor$Attrition,choices = 2:3,varname.size = 3,varname.adjust = 2,alpha = 0.5)+ggtitle("PCA of Attrition Dataset PC2,PC3")+theme_minimal()
p3

#biplot without ellipse, pc2,3
p4 <- ggbiplot::ggbiplot(quitters.pca.scaled,ellipse = FALSE,circle = FALSE,groups = PCAdat_factor$Attrition,choices = 2:3,varname.size = 3,varname.adjust = 2,alpha = 0.5)+ggtitle("PCA of Attrition Dataset PC2,PC3")+theme_minimal()
p4

####PC3 Interpretation####
#Very large negative loadings on percent salary hike and performance rating.  This component seems to measure an employees recordabable performance.  
##This component is the tertiary PC and explains 6.75% of the variation in the data, making it rougly as important as 1

#biplot with ellipse, pc3,4
p5 <- ggbiplot::ggbiplot(quitters.pca.scaled,ellipse = TRUE,circle = FALSE,groups = PCAdat_factor$Attrition,choices = 3:4,varname.size = 3,varname.adjust = 2,alpha = 0.5)+ggtitle("PCA of Attrition Dataset PC3,PC4")+theme_minimal()
p5

#biplot without ellipse, pc3,4
p6 <- ggbiplot::ggbiplot(quitters.pca.scaled,ellipse = FALSE,circle = FALSE,groups = PCAdat_factor$Attrition,choices = 3:4,varname.size = 3,varname.adjust = 2,alpha = 0.5)+ggtitle("PCA of Attrition Dataset PC3,PC4")+theme_minimal()
p6

####PC4 Interpretation####
#Negative loadings on department, martial status, and performance ratings.  This component appears to measure employees with roles that could potentially be all consuming
#This component is the fourth PC and explains 6.33% of the variation in the data

#################################
#Note!
#These four PC's only explain 37.46232% of the variation in the data, I will use and additional 11 PC's for further analysis.  I have not included loading plots for these as the interpretation is too nuanced. 

#15 PCs capture 81% of the variance in the data (0.1735757+0.07028831+0.06750148+0.06325877+0.05967773+0.04353646+0.04226411+0.04068733+0.03963922+0.03889302+0.03744872+0.036015730+0.03450615+0.03405034+0.03339357)

#Useful line for sorting PC's
#sort(abs(quitters.pca.scaled$rotation[,2]),decreasing = TRUE)

# barplot((quitters.pca.scaled$rotation[,1]),main="PC 1 Loadings Plot",las=2,+abline(0.30,0))
####PCA concerning Attrition####
#prep data
pcaDatPlots <- pcaDat[,-c(17,22)]
pcaDatPlots <- pcaDatPlots[order(pcaDatPlots$Attrition),]
pcaDatPlots <- pcaDatPlots[,-c(2)]

#Perform PCA
quitters.pca.scaledPlots <- prcomp(pcaDatPlots,scale = TRUE)

s <- summary(quitters.pca.scaledPlots)
#Plot PC1 PC2
fviz_eig(quitters.pca.scaledPlots, addlabels=TRUE, hjust = -0.3,
               linecolor ="red",ncp = 15) + theme_minimal()

# Create groups
pch.group <- c(rep(21, times=730), rep(22, times=140))
col.group <- c(rep("gold", times=730), rep("skyblue2", times=140))

# Plot individuals
plot(quitters.pca.scaledPlots$x[,1], quitters.pca.scaledPlots$x[,2], xlab=paste("PCA 1 (", round(s$importance[2]*100, 1), "%)", sep = ""), ylab=paste("PCA 2 (", round(s$importance[5]*100, 1), "%)", sep = ""), pch=pch.group, col="black", bg=col.group, cex=1, las=1,asp=1)
# Add grid lines
abline(v=0, lty=2, col="grey50")
abline(h=0, lty=2, col="grey50")

# Add legend
legend("bottomleft", legend=c("No Attrition", "Attrition"), col="black", pt.bg=c("gold", "skyblue2"), pch=c(21, 22), pt.cex=1.5)

# Get individuals (observations) as a matrix
tab <- matrix(c(quitters.pca.scaledPlots$x[,1], quitters.pca.scaledPlots$x[,2]), ncol=2)
# Calculate correlations
c1 <- cor(tab[1:729,])
c2 <- cor(tab[730:840,])


# Load package
library(ellipse)
## 
## Attaching package: 'ellipse'
## The following object is masked from 'package:graphics':
## 
##     pairs
# Plot ellipse
polygon(ellipse(c1*(max(abs(quitters.pca.scaledPlots$rotation))*1), centre=colMeans(tab[1:729,]), level=0.95), col=adjustcolor("gold", alpha.f=0.25), border="gold")
polygon(ellipse(c2*(max(abs(quitters.pca.scaledPlots$rotation))*1), centre=colMeans(tab[730:870,]), level=0.95), col=adjustcolor("skyblue2", alpha.f=0.25), border="skyblue2")

# Create matrix of x coordinates (PC1) and multiply by 10
l.x <- cbind(quitters.pca.scaledPlots$rotation[,1][c(11, 13, 17, 12)]) * 10
# y coordinates (PC2)
l.y <- cbind(quitters.pca.scaledPlots$rotation[,2][c(11,13,17,12)]) * 10

# Add arrows to biplot
arrows(x0=0, x1=l.x, y0=0, y1=l.y, col="red", length=0.15, lwd=2)
# Labels
text(l.x, l.y, labels=rownames(l.x) , col="black", pos=c(2, 3, 4, 1), offset=1, cex=1.2)

###Interpretation of PC1 and PC2####
#The first component is positively correlated with total working years, years at company, job level, monthly income.  This suggests the give variables vary together and when one goes down, the others decrease as well.  The component could be considered a primary measure of career length and longevity and well as a measure of success.

#The second component is mostly correlated with number of companies worked, and age, but in a negative direction.  Years with current manager and years in current role are correlated with the second component in the positive direction, which indicates that as number of companies and age decrease, years with current manager and years in current role increase.  

#Looking at the biplot, we can see the loadings that point in the direction of the attrition ellipse.  In terms of the first component, we can see that Job Role and Marriage Status point more towards the attrition ellipse.  The same with Job Satisfaction and percent salary hike for the second component. 
# Plot PC2 PC3
plot(quitters.pca.scaledPlots$x[,2], quitters.pca.scaledPlots$x[,3], xlab=paste("PCA 2 (", round(s$importance[5]*100, 1), "%)", sep = ""), ylab=paste("PCA 3 (", round(s$importance[8]*100, 1), "%)", sep = ""), pch=pch.group, col="black", bg=col.group, cex=1, las=1,asp=0.5)
# Add grid lines
abline(v=0, lty=2, col="grey50")
abline(h=0, lty=2, col="grey50")

# Add legend
legend("bottomleft", legend=c("No Attrition", "Attrition"), col="black", pt.bg=c("gold", "skyblue2"), pch=c(21, 22), pt.cex=1.5)

# Get individuals (observations) as a matrix
tab <- matrix(c(quitters.pca.scaledPlots$x[,2], quitters.pca.scaledPlots$x[,3]), ncol=2)
# Calculate correlations
c1 <- cor(tab[1:729,])
c2 <- cor(tab[730:840,])


# Load package
library(ellipse)
# Plot ellipse
polygon(ellipse(c1*(max(abs(quitters.pca.scaledPlots$rotation))*1), centre=colMeans(tab[1:729,]), level=0.95), col=adjustcolor("gold", alpha.f=0.25), border="gold")
polygon(ellipse(c2*(max(abs(quitters.pca.scaledPlots$rotation))*1), centre=colMeans(tab[730:870,]), level=0.95), col=adjustcolor("skyblue2", alpha.f=0.25), border="skyblue2")

# Create matrix of x coordinates (PC1) and multiply by 10
l.x <- cbind(quitters.pca.scaledPlots$rotation[,2][c(13, 26, 3, 11)]) * 10
# y coordinates (PC2)
l.y <- cbind(quitters.pca.scaledPlots$rotation[,3][c(13,26,3,11)]) * 10

# Add arrows to biplot
arrows(x0=0, x1=l.x, y0=0, y1=l.y, col="red", length=0.15, lwd=3)
# Labels
text(l.x, l.y, labels=rownames(l.x) , col="black", pos=c(2, 3, 1, 4), offset=1, cex=1.2)

###Interpretation of PC3####

#The third component is most correlated with performance rating and percent salary hike, both in a negative direction.  Department and job role are correlated with the third component in a positive direction, suggesting that as performance rating and percent salary hike decrease, department and job role increase.

#Looking at the biplot, we can see the loadings that point in the direction of the attrition ellipse.  In terms of the third component, we can see that department and job role point more towards the attrition ellipse, this is the second biplot that job role has pointed towards attrition, suggesting this may be a variable of interest concerning attrition.
# Plot PC3 PC4
plot(quitters.pca.scaledPlots$x[,3], quitters.pca.scaledPlots$x[,4], xlab=paste("PCA 3 (", round(s$importance[8]*100, 1), "%)", sep = ""), ylab=paste("PCA 4 (", round(s$importance[11]*100, 1), "%)", sep = ""), pch=pch.group, col="black", bg=col.group, cex=1, las=1,asp=1)
# Add grid lines
abline(v=0, lty=2, col="grey50")
abline(h=0, lty=2, col="grey50")

# Add legend
legend("topleft", legend=c("No Attrition", "Attrition"), col="black", pt.bg=c("gold", "skyblue2"), pch=c(21, 22), pt.cex=1.5)

# Get individuals (observations) as a matrix
tab <- matrix(c(quitters.pca.scaledPlots$x[,3], quitters.pca.scaledPlots$x[,4]), ncol=2)
# Calculate correlations
c1 <- cor(tab[1:729,])
c2 <- cor(tab[730:840,])


# Load package
library(ellipse)
# Plot ellipse
polygon(ellipse(c1*(max(abs(quitters.pca.scaledPlots$rotation))*1), centre=colMeans(tab[1:729,]), level=0.95), col=adjustcolor("gold", alpha.f=0.25), border="gold")
polygon(ellipse(c2*(max(abs(quitters.pca.scaledPlots$rotation))*1), centre=colMeans(tab[730:870,]), level=0.95), col=adjustcolor("skyblue2", alpha.f=0.25), border="skyblue2")

# Create matrix of x coordinates (PC1) and multiply by 10
l.x <- cbind(quitters.pca.scaledPlots$rotation[,3][c(14, 3, 13, 1)]) * 10
# y coordinates (PC2)
l.y <- cbind(quitters.pca.scaledPlots$rotation[,4][c(14, 3, 13, 1)]) * 10

# Add arrows to biplot
arrows(x0=0, x1=l.x, y0=0, y1=l.y, col="red", length=0.15, lwd=3)
# Labels
text(l.x, l.y, labels=rownames(l.x) , col="black", pos=c(1, 3, 2, 3), offset=1, cex=1.2)

###Interpretation of PC4####

#The fourth component is most correlated with stock option level.  Martial Status and percent salary hike are correlated with the fourth component, but in a negative direction.  This suggest that as stock option level increases, martial status and percent salary hike decrease.  

#Looking at the biplot, we can see the loadings that point in the direction of the attrition ellipse.  In terms of the the fourth component, we can see that marital status and age point more towards the attrition ellipse, this is the second biplot that maritial status has pointed towards attrition, suggesting this may be another variable of interest concerning attrition.
####Loadings Plots####
#PC1
c.pc1 <- ifelse(quitters.pca.scaledPlots$rotation[,1] > 0, yes="green2", no="red2")
n.pc1 <- ifelse(quitters.pca.scaledPlots$rotation[,1] > 0, yes=-0.01, no=quitters.pca.scaledPlots$rotation[,1]-0.01)
par(mar=c(8,3,2,1)) # Set margins
b1 <- barplot(quitters.pca.scaledPlots$rotation[,1], main="PC 1 Loadings Plot", col=c.pc1, las=2, axisnames=FALSE,ylim = c(-0.3,0.5))
abline(h=0) # Add horizontal line
text(x=b1, y=n.pc1, labels=names(quitters.pca.scaledPlots$rotation[,1]), adj=1, srt=90, xpd=TRUE) # Add variable names

#PC2
c.pc2 <- ifelse(quitters.pca.scaledPlots$rotation[,2] > 0, yes="green2", no="red2")
n.pc2 <- ifelse(quitters.pca.scaledPlots$rotation[,2] > 0, yes=-0.01, no=quitters.pca.scaledPlots$rotation[,2]-0.01)
par(mar=c(8,3,2,1)) # Set margins
b2 <- barplot(quitters.pca.scaledPlots$rotation[,2], main="PC 2 Loadings Plot", col=c.pc2, las=2, axisnames=FALSE)
abline(h=0) # Add horizontal line
text(x=b2, y=n.pc2, labels=names(quitters.pca.scaledPlots$rotation[,2]), adj=1, srt=90, xpd=TRUE) # Add variable names

#PC3
c.pc3 <- ifelse(quitters.pca.scaledPlots$rotation[,3] > 0, yes="green2", no="red2")
n.pc3 <- ifelse(quitters.pca.scaledPlots$rotation[,3] > 0, yes=-0.01, no=quitters.pca.scaledPlots$rotation[,3]-0.01)
par(mar=c(8,3,2,1)) # Set margins
b3 <- barplot(quitters.pca.scaledPlots$rotation[,3], main="PC 3 Loadings Plot", col=c.pc3, las=2, axisnames=FALSE,ylim = c(-0.6,0.6))
abline(h=0) # Add horizontal line
text(x=b3, y=n.pc3, labels=names(quitters.pca.scaledPlots$rotation[,3]), adj=1, srt=90, xpd=TRUE) # Add variable names

#PC4
c.pc4 <- ifelse(quitters.pca.scaledPlots$rotation[,4] > 0, yes="green2", no="red2")
n.pc4 <- ifelse(quitters.pca.scaledPlots$rotation[,4] > 0, yes=-0.01, no=quitters.pca.scaledPlots$rotation[,4]-0.01)
par(mar=c(8,3,2,1)) # Set margins
b4 <- barplot(quitters.pca.scaledPlots$rotation[,4], main="PC 4 Loadings Plot", col=c.pc4, las=2, axisnames=FALSE,ylim = c(-0.4,0.6))
abline(h=0) # Add horizontal line
text(x=b4, y=n.pc4, labels=names(quitters.pca.scaledPlots$rotation[,4]), adj=1, srt=90, xpd=TRUE) # Add variable names

#PC5
c.pc5 <- ifelse(quitters.pca.scaledPlots$rotation[,5] > 0, yes="green2", no="red2")
n.pc5 <- ifelse(quitters.pca.scaledPlots$rotation[,5] > 0, yes=-0.01, no=quitters.pca.scaledPlots$rotation[,5]-0.01)
par(mar=c(8,3,2,1)) # Set margins
b5 <- barplot(quitters.pca.scaledPlots$rotation[,5], main="PC 5 Loadings Plot", col=c.pc5, las=2, axisnames=FALSE,ylim = c(-0.6,0.6))
abline(h=0) # Add horizontal line
text(x=b5, y=n.pc5, labels=names(quitters.pca.scaledPlots$rotation[,5]), adj=1, srt=90, xpd=TRUE) # Add variable names

#PC6
c.pc6 <- ifelse(quitters.pca.scaledPlots$rotation[,6] > 0, yes="green2", no="red2")
n.pc6 <- ifelse(quitters.pca.scaledPlots$rotation[,6] > 0, yes=-0.01, no=quitters.pca.scaledPlots$rotation[,6]-0.01)
par(mar=c(8,3,2,1)) # Set margins
b6 <- barplot(quitters.pca.scaledPlots$rotation[,6], main="PC 6 Loadings Plot", col=c.pc6, las=2, axisnames=FALSE,ylim = c(-0.4,0.6))
abline(h=0) # Add horizontal line
text(x=b6, y=n.pc6, labels=names(quitters.pca.scaledPlots$rotation[,6]), adj=1, srt=90, xpd=TRUE) # Add variable names

####PCA concerning Job Role Trends####
#prep data
pcaDatPlotsRoles <- pcaDat[,-c(17,22)]
pcaDatPlotsRoles <- pcaDatPlotsRoles[order(pcaDatPlotsRoles$JobRole),]
pcaDatPlotsRoles <- pcaDatPlotsRoles[,-c(12)]

#Perform PCA
rolePCA <- prcomp(pcaDatPlotsRoles,scale = TRUE,center = TRUE)

#Save summary object
s <- summary(rolePCA)
#s$rotation
#Scree plot
fviz_eig(rolePCA, addlabels=TRUE, hjust = -0.3,
               linecolor ="red",ncp = 15) + theme_minimal()

# Create groups
pch.group <- c(rep(25, times=76), rep(15, times=27), rep(21, times=153), rep(22, times=51), rep(23, times=87), rep(24, times=51), rep(25, times=172), rep(21, times=200), rep(22, times=53))
col.group <- c(rep("green", times=76), rep("brown", times=27), rep("red", times=153), rep("purple", times=51), rep("cyan", times=87), rep("gold", times=51), rep("coral", times=172), rep("deeppink", times=200), rep("midnightblue", times=53))




# Plot individuals
plot(rolePCA$x[,1], rolePCA$x[,2], xlab=paste("PCA 1 (", round(s$importance[2]*100, 1), "%)", sep = ""), ylab=paste("PCA 2 (", round(s$importance[5]*100, 1), "%)", sep = ""), pch=pch.group, col="black", bg=col.group, cex=1, las=1,asp=1)
# Add grid lines
abline(v=0, lty=2, col="grey50")
abline(h=0, lty=2, col="grey50")


# Add legend
legend("topleft", legend=c("Healthcare Representative", "Human Resources","Laboratory Technician","Manager","Manufacturing Director","Research Director","Research Scientist","Sales Executive","Sales Representative"), col="black", pt.bg=c("green", "brown","red","purple","cyan","gold","coral","deeppink","midnightblue"), pch=c(25,15,21,22,23,24,25, 21,22), pt.cex=1.5,cex = 0.5)

# Get individuals (observations) as a matrix
tab <- matrix(c(rolePCA$x[,1], rolePCA$x[,2]), ncol=2)
# Calculate correlations
c1 <- cor(tab[1:56,])
c2 <- cor(tab[57:83,])
c3 <- cor(tab[84:256,])
c4 <- cor(tab[257:287,])
c5 <- cor(tab[288:394,])
c6 <- cor(tab[395:445,])
c7 <- cor(tab[446:617,])
c8 <- cor(tab[618:815,])
c9 <- cor(tab[816:870,])

# Load package
library(ellipse)
# Plot ellipse
polygon(ellipse(c1*(max(abs(quitters.pca.scaledPlots$rotation))*1), centre=colMeans(tab[1:56,]), level=0.95), col=adjustcolor("green", alpha.f=0.25), border="green")
polygon(ellipse(c2*(max(abs(quitters.pca.scaledPlots$rotation))*1), centre=colMeans(tab[57:83,]), level=0.95), col=adjustcolor("brown", alpha.f=0.25), border="brown")
polygon(ellipse(c1*(max(abs(quitters.pca.scaledPlots$rotation))*1), centre=colMeans(tab[84:256,]), level=0.95), col=adjustcolor("red", alpha.f=0.25), border="red")
polygon(ellipse(c1*(max(abs(quitters.pca.scaledPlots$rotation))*1), centre=colMeans(tab[257:287,]), level=0.95), col=adjustcolor("purple", alpha.f=0.25), border="purple")
polygon(ellipse(c1*(max(abs(quitters.pca.scaledPlots$rotation))*1), centre=colMeans(tab[288:394,]), level=0.95), col=adjustcolor("cyan", alpha.f=0.25), border="cyan")
polygon(ellipse(c1*(max(abs(quitters.pca.scaledPlots$rotation))*1), centre=colMeans(tab[395:445,]), level=0.95), col=adjustcolor("gold", alpha.f=0.25), border="gold")
polygon(ellipse(c1*(max(abs(quitters.pca.scaledPlots$rotation))*1), centre=colMeans(tab[446:617,]), level=0.95), col=adjustcolor("coral", alpha.f=0.25), border="coral")
polygon(ellipse(c1*(max(abs(quitters.pca.scaledPlots$rotation))*1), centre=colMeans(tab[618:815,]), level=0.95), col=adjustcolor("deeppink", alpha.f=0.25), border="deeppink")
polygon(ellipse(c1*(max(abs(quitters.pca.scaledPlots$rotation))*1), centre=colMeans(tab[816:870,]), level=0.95), col=adjustcolor("midnightblue", alpha.f=0.25), border="midnightblue")

which.max(rolePCA$rotation[,1])
## Attrition 
##         2
which.min(rolePCA$rotation[,1])
## TotalWorkingYears 
##                21
which.max(rolePCA$rotation[,2])
## NumCompaniesWorked 
##                 15
which.min(rolePCA$rotation[,2])
## YearsWithCurrManager 
##                   27
sort((rolePCA$rotation[,1]),decreasing = FALSE)
##        TotalWorkingYears           YearsAtCompany                 JobLevel 
##            -0.4039250326            -0.3863095225            -0.3750829813 
##            MonthlyIncome       YearsInCurrentRole     YearsWithCurrManager 
##            -0.3665270508            -0.3387270859            -0.3251883192 
##  YearsSinceLastPromotion                      Age                Education 
##            -0.2944198338            -0.2658002123            -0.0873146776 
##       NumCompaniesWorked         StockOptionLevel          WorkLifeBalance 
##            -0.0419318788            -0.0409099066            -0.0254835793 
##               Department           BusinessTravel           JobInvolvement 
##            -0.0218980340            -0.0080141378            -0.0015671711 
## RelationshipSatisfaction  EnvironmentSatisfaction         DistanceFromHome 
##             0.0005975655             0.0032674279             0.0052492349 
##          JobSatisfaction           EducationField    TrainingTimesLastYear 
##             0.0145389198             0.0154244852             0.0184668393 
##                 OverTime        PerformanceRating        PercentSalaryHike 
##             0.0272559299             0.0319650798             0.0464420470 
##                   Gender            MaritalStatus                Attrition 
##             0.0518093629             0.0571818467             0.1086653341
# Create matrix of x coordinates (PC1) and multiply by 10
l.x <- cbind(rolePCA$rotation[,1][c(2, 21,15,27)]) * 10
# y coordinates (PC2)
l.y <- cbind(rolePCA$rotation[,2][c(2, 21,15,27)]) * 10

# Add arrows to biplot
arrows(x0=0, x1=l.x, y0=0, y1=l.y, col="red", length=0.15, lwd=2)
# Labels
text(l.x, l.y, labels=rownames(l.x) , col="black", pos=c(2, 3, 4, 1), offset=1, cex=1.2)

#Analysis using Naive Bayes and Regularized Discriminant Analysis

###############################################################################
#Printing functions
printALL=function(model){
  trainPred=predict(model, newdata = train, type = "class")
  trainTable=table(train$Attrition, trainPred)
  testPred=predict(model, newdata=test, type="class")
  testTable=table(test$Attrition, testPred)
  trainAcc=(trainTable[1,1]+trainTable[2,2])/sum(trainTable)
  testAcc=(testTable[1,1]+testTable[2,2])/sum(testTable)
  message("Contingency Table for Training Data")
  print(trainTable)
  message("Contingency Table for Test Data")
  print(testTable)
  message("Accuracy")
  print(round(cbind(trainAccuracy=trainAcc, testAccuracy=testAcc),3))
}

ggplotConfusionMatrix <- function(m){
  mytitle <- paste("Accuracy", percent_format()(m$overall[1]),
                   "Sensitivity",percent_format()(m$byClass[1]),"Specificity",percent_format()(m$byClass[2]))
  p <-
    ggplot(data = as.data.frame(m$table) ,
           aes(x = Prediction, y = Reference)) +
    geom_tile(aes(fill = log(Freq)), colour = "white") +
    scale_fill_gradient(low = "white", high = "steelblue") +
    geom_text(aes(x = Reference, y = Prediction, label = Freq)) +
    theme(legend.position = "none") +
    ggtitle(mytitle)
  return(p)
}

###############################################################################

#Read in primary data ("CaseStudy2-data.txt")
primaryData <- read.table("/Users/benjamingoodwin/Desktop/casestudy2DDS/CaseStudy2DDS/CaseStudy2-data.txt",sep = ",",header = TRUE)

#Convert all CHR variables to factors to examine pairs data
primaryData_factor <- primaryData %>%
  mutate_if(sapply(primaryData, is.character), as.factor)


#Convert all CHR variables to ints for COV 
primaryData_int <- primaryData_factor %>%
  mutate_if(sapply(primaryData_factor, is.factor), as.integer)

nnDat <- primaryData_int[,-c(1,5,10,11,14,21,23,24,28)]

#Over sample to correct for imbalance
nnDat <- oversample(nnDat,classAttr = "Attrition",method = "ADASYN")


nnDat$Attrition <- (ifelse(nnDat$Attrition==1,"No","Yes"))
nnDat$Attrition <- as.factor(nnDat$Attrition)

set.seed(107)


inTrain <- createDataPartition(y=nnDat$Attrition,p=0.75,list = FALSE)
class(nnDat$Attrition)
## [1] "factor"
levels(nnDat$Attrition)
## [1] "No"  "Yes"
training <- nnDat[inTrain,]
testing <- nnDat[-inTrain,]

ctrl <- trainControl(method = "repeatedcv", repeats = 3,classProbs = TRUE, summaryFunction = twoClassSummary)

noAttDs <- read.table("/Users/benjamingoodwin/Desktop/casestudy2DDS/CaseStudy2DDS/noAttrition.txt",sep = ",",header = TRUE)

nnDat1 <- noAttDs
#Remove unneeded cols
#noAttDs <- noAttDs[,-c(4,9,10,13,20,22,23,27)]

#Convert all CHR variables to factors to examine pairs data
nnDat1 <- nnDat1 %>%
  mutate_if(sapply(nnDat1, is.character), as.factor)




#Make factors
noAttDs <- noAttDs %>%
  mutate_if(sapply(noAttDs, is.integer), as.factor)
#Remove unneeded cols
noAttDs <- noAttDs[,-c(4,9,10,13,20,22,23,27)]


#Convert all CHR variables to ints for COV 
nnDat1 <- nnDat1 %>%
  mutate_if(sapply(nnDat1, is.factor), as.integer)

rdaGrid = data.frame(gamma = (0:4)/4, lambda = 3/4)
set.seed(123)
rdaFit <- train(Attrition ~ .,data = training,method = "rda",tuneGrid = rdaGrid,trControl = ctrl,metric = "ROC")
rdaFit
## Regularized Discriminant Analysis 
## 
## 1085 samples
##   26 predictor
##    2 classes: 'No', 'Yes' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 977, 976, 976, 976, 976, 977, ... 
## Resampling results across tuning parameters:
## 
##   gamma  ROC        Sens       Spec     
##   0.00   0.9217914  0.8388552  0.8566504
##   0.25   0.6236483  0.0918743  0.9701957
##   0.50   0.6236371  0.1368799  0.9552644
##   0.75   0.6236144  0.2146801  0.8590496
##   1.00   0.6237277  0.3496745  0.7529234
## 
## Tuning parameter 'lambda' was held constant at a value of 0.75
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were gamma = 0 and lambda = 0.75.
rdaClasses <- predict(rdaFit, newdata = testing)
cfm <- confusionMatrix(rdaClasses, testing$Attrition)
cfm
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  153  25
##        Yes  29 153
##                                           
##                Accuracy : 0.85            
##                  95% CI : (0.8089, 0.8853)
##     No Information Rate : 0.5056          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.7             
##                                           
##  Mcnemar's Test P-Value : 0.6831          
##                                           
##             Sensitivity : 0.8407          
##             Specificity : 0.8596          
##          Pos Pred Value : 0.8596          
##          Neg Pred Value : 0.8407          
##              Prevalence : 0.5056          
##          Detection Rate : 0.4250          
##    Detection Prevalence : 0.4944          
##       Balanced Accuracy : 0.8501          
##                                           
##        'Positive' Class : No              
## 
ggplotConfusionMatrix(cfm)

other1 <- predict(rdaFit,newdata = nnDat1,type = "raw")
#view(other1)


newDf <- noAttDs
newDf$Attrition=other1
#View(newDf)

newDf <- newDf[,c(1,28)]

out <- write.csv(newDf,"~/Documents/Case2PredictionsGoodwinAttrition.csv",row.names = FALSE)


################################################################################

####Factor with over sampling, model 2####

##Data prep for NB
#Read in primary data ("CaseStudy2-data.txt")
primaryData <- read.table("/Users/benjamingoodwin/Desktop/casestudy2DDS/CaseStudy2DDS/CaseStudy2-data.txt",sep = ",",header = TRUE)

#Convert all CHR variables to factors to examine pairs data
primaryData_factor <- primaryData %>%
  mutate_if(sapply(primaryData, is.character), as.factor)


#Convert all CHR variables to ints for COV 
primaryData_int <- primaryData_factor %>%
  mutate_if(sapply(primaryData_factor, is.factor), as.integer)

nnDat <- primaryData_int[,-c(1,5,10,11,14,21,23,24,28)]

#Create DF
dataAllFact = primaryData_int[,-c(1,5,10,11,14,21,23,24,28)]


#Over sample to correct for imbalance
newDataAllFact <- oversample(dataAllFact,classAttr = "Attrition",method = "ADASYN")

#Oversample changes all to numeric, revert to factor
newDataAllFactor <- newDataAllFact %>%
  mutate_if(sapply(newDataAllFact, is.integer), as.factor)

newDataAllFactor$Attrition <- (ifelse(newDataAllFactor$Attrition==1,"No","Yes"))
newDataAllFactor$Attrition <- as.factor(newDataAllFactor$Attrition)

#Check out the imbalance ratio
imbalanceRatio(newDataAllFactor,"Attrition")
## [1] 0.9794521
#Do the stuff for NB
#Set seed to easily duplicate results
set.seed(7267166)

#Split training and testing data
trainIndex=createDataPartition(newDataAllFactor$Attrition, p=0.70)$Resample1
train=newDataAllFactor[trainIndex, ]
test=newDataAllFactor[-trainIndex, ]

#Create classifier
NBclassfier=naiveBayes(Attrition~., data=train)

#call print function
printALL(NBclassfier)
## Contingency Table for Training Data
##      trainPred
##        No Yes
##   No  475  36
##   Yes  63 438
## Contingency Table for Test Data
##      testPred
##        No Yes
##   No  191  28
##   Yes  36 178
## Accuracy
##      trainAccuracy testAccuracy
## [1,]         0.902        0.852
#Generate CM
nb_train_predict <- predict(NBclassfier,test[,names(test) !="Attrition"])
cfm <- confusionMatrix(nb_train_predict,test$Attrition)
cfm
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  191  36
##        Yes  28 178
##                                           
##                Accuracy : 0.8522          
##                  95% CI : (0.8152, 0.8843)
##     No Information Rate : 0.5058          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.7042          
##                                           
##  Mcnemar's Test P-Value : 0.3816          
##                                           
##             Sensitivity : 0.8721          
##             Specificity : 0.8318          
##          Pos Pred Value : 0.8414          
##          Neg Pred Value : 0.8641          
##              Prevalence : 0.5058          
##          Detection Rate : 0.4411          
##    Detection Prevalence : 0.5242          
##       Balanced Accuracy : 0.8520          
##                                           
##        'Positive' Class : No              
## 
#Look at plot
ggplotConfusionMatrix(cfm)

###############################################################################
#Make predictions, with NB model
noAttDs <- read.table("/Users/benjamingoodwin/Desktop/casestudy2DDS/CaseStudy2DDS/noAttrition.txt",sep = ",",header = TRUE)
#Make factors
noAttDs <- noAttDs %>%
  mutate_if(sapply(noAttDs, is.integer), as.factor)
#Remove unneeded cols
noAttDs <- noAttDs[,-c(4,9,10,13,20,22,23,27)]



nb_train_predictNewDS <- predict(NBclassfier,noAttDs)
nb_train_predictNewDS
##   [1] No  No  Yes No  No  No  No  No  No  Yes No  Yes No  No  No  No  No  No 
##  [19] No  No  Yes No  No  No  No  No  No  Yes Yes Yes No  No  No  No  Yes Yes
##  [37] No  No  No  No  No  Yes No  No  Yes No  No  Yes Yes No  No  Yes No  No 
##  [55] No  No  No  No  No  No  No  No  Yes No  No  Yes No  No  No  No  Yes No 
##  [73] No  No  No  Yes No  No  Yes No  No  No  Yes No  No  No  No  No  No  Yes
##  [91] No  No  No  No  No  No  No  No  Yes Yes No  No  No  No  Yes No  No  No 
## [109] No  No  No  No  No  No  No  No  No  No  Yes No  No  No  No  No  No  No 
## [127] Yes No  No  No  No  No  No  Yes No  No  No  Yes No  Yes No  No  No  No 
## [145] No  No  Yes Yes No  No  Yes No  No  No  No  Yes Yes No  Yes Yes Yes Yes
## [163] Yes No  No  No  No  No  No  No  Yes No  Yes Yes No  No  Yes No  No  No 
## [181] Yes No  No  No  No  No  No  Yes No  No  No  No  No  No  Yes No  No  No 
## [199] No  No  No  No  No  No  No  No  No  No  No  No  Yes No  No  No  No  No 
## [217] Yes No  No  No  No  Yes No  No  No  No  No  No  No  No  Yes No  No  Yes
## [235] No  No  No  No  No  No  No  No  No  No  No  No  Yes No  No  No  Yes No 
## [253] No  Yes No  No  No  No  No  No  No  No  No  No  No  No  No  No  Yes No 
## [271] No  Yes No  Yes Yes Yes Yes No  Yes Yes Yes No  No  No  Yes Yes No  Yes
## [289] No  No  No  No  Yes No  Yes No  No  Yes No  No 
## Levels: No Yes
table(nb_train_predictNewDS)
## nb_train_predictNewDS
##  No Yes 
## 231  69
newDf <- noAttDs
newDf$Attrition=nb_train_predictNewDS


newDf <- newDf[,c(1,28)]
newDf
##       ID Attrition
## 1   1171        No
## 2   1172        No
## 3   1173       Yes
## 4   1174        No
## 5   1175        No
## 6   1176        No
## 7   1177        No
## 8   1178        No
## 9   1179        No
## 10  1180       Yes
## 11  1181        No
## 12  1182       Yes
## 13  1183        No
## 14  1184        No
## 15  1185        No
## 16  1186        No
## 17  1187        No
## 18  1188        No
## 19  1189        No
## 20  1190        No
## 21  1191       Yes
## 22  1192        No
## 23  1193        No
## 24  1194        No
## 25  1195        No
## 26  1196        No
## 27  1197        No
## 28  1198       Yes
## 29  1199       Yes
## 30  1200       Yes
## 31  1201        No
## 32  1202        No
## 33  1203        No
## 34  1204        No
## 35  1205       Yes
## 36  1206       Yes
## 37  1207        No
## 38  1208        No
## 39  1209        No
## 40  1210        No
## 41  1211        No
## 42  1212       Yes
## 43  1213        No
## 44  1214        No
## 45  1215       Yes
## 46  1216        No
## 47  1217        No
## 48  1218       Yes
## 49  1219       Yes
## 50  1220        No
## 51  1221        No
## 52  1222       Yes
## 53  1223        No
## 54  1224        No
## 55  1225        No
## 56  1226        No
## 57  1227        No
## 58  1228        No
## 59  1229        No
## 60  1230        No
## 61  1231        No
## 62  1232        No
## 63  1233       Yes
## 64  1234        No
## 65  1235        No
## 66  1236       Yes
## 67  1237        No
## 68  1238        No
## 69  1239        No
## 70  1240        No
## 71  1241       Yes
## 72  1242        No
## 73  1243        No
## 74  1244        No
## 75  1245        No
## 76  1246       Yes
## 77  1247        No
## 78  1248        No
## 79  1249       Yes
## 80  1250        No
## 81  1251        No
## 82  1252        No
## 83  1253       Yes
## 84  1254        No
## 85  1255        No
## 86  1256        No
## 87  1257        No
## 88  1258        No
## 89  1259        No
## 90  1260       Yes
## 91  1261        No
## 92  1262        No
## 93  1263        No
## 94  1264        No
## 95  1265        No
## 96  1266        No
## 97  1267        No
## 98  1268        No
## 99  1269       Yes
## 100 1270       Yes
## 101 1271        No
## 102 1272        No
## 103 1273        No
## 104 1274        No
## 105 1275       Yes
## 106 1276        No
## 107 1277        No
## 108 1278        No
## 109 1279        No
## 110 1280        No
## 111 1281        No
## 112 1282        No
## 113 1283        No
## 114 1284        No
## 115 1285        No
## 116 1286        No
## 117 1287        No
## 118 1288        No
## 119 1289       Yes
## 120 1290        No
## 121 1291        No
## 122 1292        No
## 123 1293        No
## 124 1294        No
## 125 1295        No
## 126 1296        No
## 127 1297       Yes
## 128 1298        No
## 129 1299        No
## 130 1300        No
## 131 1301        No
## 132 1302        No
## 133 1303        No
## 134 1304       Yes
## 135 1305        No
## 136 1306        No
## 137 1307        No
## 138 1308       Yes
## 139 1309        No
## 140 1310       Yes
## 141 1311        No
## 142 1312        No
## 143 1313        No
## 144 1314        No
## 145 1315        No
## 146 1316        No
## 147 1317       Yes
## 148 1318       Yes
## 149 1319        No
## 150 1320        No
## 151 1321       Yes
## 152 1322        No
## 153 1323        No
## 154 1324        No
## 155 1325        No
## 156 1326       Yes
## 157 1327       Yes
## 158 1328        No
## 159 1329       Yes
## 160 1330       Yes
## 161 1331       Yes
## 162 1332       Yes
## 163 1333       Yes
## 164 1334        No
## 165 1335        No
## 166 1336        No
## 167 1337        No
## 168 1338        No
## 169 1339        No
## 170 1340        No
## 171 1341       Yes
## 172 1342        No
## 173 1343       Yes
## 174 1344       Yes
## 175 1345        No
## 176 1346        No
## 177 1347       Yes
## 178 1348        No
## 179 1349        No
## 180 1350        No
## 181 1351       Yes
## 182 1352        No
## 183 1353        No
## 184 1354        No
## 185 1355        No
## 186 1356        No
## 187 1357        No
## 188 1358       Yes
## 189 1359        No
## 190 1360        No
## 191 1361        No
## 192 1362        No
## 193 1363        No
## 194 1364        No
## 195 1365       Yes
## 196 1366        No
## 197 1367        No
## 198 1368        No
## 199 1369        No
## 200 1370        No
## 201 1371        No
## 202 1372        No
## 203 1373        No
## 204 1374        No
## 205 1375        No
## 206 1376        No
## 207 1377        No
## 208 1378        No
## 209 1379        No
## 210 1380        No
## 211 1381       Yes
## 212 1382        No
## 213 1383        No
## 214 1384        No
## 215 1385        No
## 216 1386        No
## 217 1387       Yes
## 218 1388        No
## 219 1389        No
## 220 1390        No
## 221 1391        No
## 222 1392       Yes
## 223 1393        No
## 224 1394        No
## 225 1395        No
## 226 1396        No
## 227 1397        No
## 228 1398        No
## 229 1399        No
## 230 1400        No
## 231 1401       Yes
## 232 1402        No
## 233 1403        No
## 234 1404       Yes
## 235 1405        No
## 236 1406        No
## 237 1407        No
## 238 1408        No
## 239 1409        No
## 240 1410        No
## 241 1411        No
## 242 1412        No
## 243 1413        No
## 244 1414        No
## 245 1415        No
## 246 1416        No
## 247 1417       Yes
## 248 1418        No
## 249 1419        No
## 250 1420        No
## 251 1421       Yes
## 252 1422        No
## 253 1423        No
## 254 1424       Yes
## 255 1425        No
## 256 1426        No
## 257 1427        No
## 258 1428        No
## 259 1429        No
## 260 1430        No
## 261 1431        No
## 262 1432        No
## 263 1433        No
## 264 1434        No
## 265 1435        No
## 266 1436        No
## 267 1437        No
## 268 1438        No
## 269 1439       Yes
## 270 1440        No
## 271 1441        No
## 272 1442       Yes
## 273 1443        No
## 274 1444       Yes
## 275 1445       Yes
## 276 1446       Yes
## 277 1447       Yes
## 278 1448        No
## 279 1449       Yes
## 280 1450       Yes
## 281 1451       Yes
## 282 1452        No
## 283 1453        No
## 284 1454        No
## 285 1455       Yes
## 286 1456       Yes
## 287 1457        No
## 288 1458       Yes
## 289 1459        No
## 290 1460        No
## 291 1461        No
## 292 1462        No
## 293 1463       Yes
## 294 1464        No
## 295 1465       Yes
## 296 1466        No
## 297 1467        No
## 298 1468       Yes
## 299 1469        No
## 300 1470        No
primaryData <- read.table("/Users/benjamingoodwin/Desktop/casestudy2DDS/CaseStudy2DDS/CaseStudy2-data.txt",sep = ",",header = TRUE)
salaryDatNoinfo <- compsetNoSalary

salaryDat <- primaryData[,c(16,20,30,7,36)]



salaryDatlm <- lm(MonthlyIncome~JobLevel+TotalWorkingYears+YearsWithCurrManager+DistanceFromHome,data = salaryDat)
summary(salaryDatlm)
## 
## Call:
## lm(formula = MonthlyIncome ~ JobLevel + TotalWorkingYears + YearsWithCurrManager + 
##     DistanceFromHome, data = salaryDat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5775.8  -859.8    21.0   727.6  4022.3 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -1559.495    114.362 -13.637  < 2e-16 ***
## JobLevel              3722.700     68.403  54.423  < 2e-16 ***
## TotalWorkingYears       67.981     10.404   6.534 1.09e-10 ***
## YearsWithCurrManager   -60.215     14.690  -4.099 4.54e-05 ***
## DistanceFromHome       -15.335      5.727  -2.677  0.00756 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1373 on 865 degrees of freedom
## Multiple R-squared:  0.9113, Adjusted R-squared:  0.9108 
## F-statistic:  2221 on 4 and 865 DF,  p-value: < 2.2e-16
RSS <- c(crossprod(salaryDatlm$residuals))
MSE <- RSS / length(salaryDatlm$residuals)
RMSE <- sqrt(MSE)
sig2 <- RSS / salaryDatlm$df.residual



##########EDA for Salary##########
salaryDatEDA <- primaryData[,-c(1,5,10,11,14,21,23,24,28)]
samp <- createDataPartition(salaryDatEDA$MonthlyIncome,p=0.8,list = FALSE)
training <- salaryDatEDA[samp,]
testing <- salaryDatEDA[-samp,]

tc <- trainControl(method = "cv",number = 10)
lm1_cv <- train(MonthlyIncome~.,data=salaryDatEDA,method="lm",trainControl=tc)
## Warning: In lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
##  extra argument 'trainControl' will be disregarded

## Warning: In lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
##  extra argument 'trainControl' will be disregarded

## Warning: In lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
##  extra argument 'trainControl' will be disregarded

## Warning: In lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
##  extra argument 'trainControl' will be disregarded

## Warning: In lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
##  extra argument 'trainControl' will be disregarded

## Warning: In lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
##  extra argument 'trainControl' will be disregarded

## Warning: In lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
##  extra argument 'trainControl' will be disregarded

## Warning: In lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
##  extra argument 'trainControl' will be disregarded

## Warning: In lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
##  extra argument 'trainControl' will be disregarded

## Warning: In lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
##  extra argument 'trainControl' will be disregarded

## Warning: In lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
##  extra argument 'trainControl' will be disregarded

## Warning: In lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
##  extra argument 'trainControl' will be disregarded

## Warning: In lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
##  extra argument 'trainControl' will be disregarded

## Warning: In lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
##  extra argument 'trainControl' will be disregarded

## Warning: In lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
##  extra argument 'trainControl' will be disregarded

## Warning: In lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
##  extra argument 'trainControl' will be disregarded

## Warning: In lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
##  extra argument 'trainControl' will be disregarded

## Warning: In lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
##  extra argument 'trainControl' will be disregarded

## Warning: In lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
##  extra argument 'trainControl' will be disregarded

## Warning: In lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
##  extra argument 'trainControl' will be disregarded

## Warning: In lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
##  extra argument 'trainControl' will be disregarded

## Warning: In lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
##  extra argument 'trainControl' will be disregarded

## Warning: In lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
##  extra argument 'trainControl' will be disregarded

## Warning: In lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
##  extra argument 'trainControl' will be disregarded

## Warning: In lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
##  extra argument 'trainControl' will be disregarded

## Warning: In lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
##  extra argument 'trainControl' will be disregarded
lm1_cv
## Linear Regression 
## 
## 870 samples
##  26 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 870, 870, 870, 870, 870, 870, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   1114.146  0.9406236  846.0561
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE
ggplot(varImp(lm1_cv))

predictedSal <- predict(salaryDatlm,salaryDatNoinfo)
table(predictedSal)
## predictedSal
## 1787.60517422736 1832.88199156988 1856.71022116769 1885.52870802359 
##                1                1                1                1 
## 1894.02229191195  1926.2118330994 1948.38754062991 1963.52390384366 
##                1                1                1                1 
## 1976.80946249443 1979.38738684322 1985.83188367291  1992.4746629983 
##                1                1                1                1 
##  2002.1579418608 2009.52784097199 2009.85839576641 2016.89774008318 
##                1                1                1                1 
## 2017.62485986898  2021.2931498542 2024.86248668143 2025.39132397154 
##                1                1                1                1 
## 2037.75148034082 2039.99884989518 2047.17046651068 2048.29415128786 
##                1                1                1                1 
## 2077.83975792956 2078.63288791232 2086.53162431362   2093.174403639 
##                1                1                1                1 
## 2093.37268613469  2098.8917806829 2102.06455251875 2108.70733184413 
##                1                1                1                1 
## 2123.84369505789 2124.24026004927 2131.08132187034 2131.61015916045 
##                1                1                1                1 
## 2137.32753620435 2139.37662326302 2147.14308736558 2148.06848964708 
##                1                1                1                1 
## 2169.84763218621 2172.59051988994 2183.33147333267 2185.18227789565 
##                1                1                1                2 
## 2192.02333971673 2207.35798542617 2207.55626792186 2208.87823519473 
##                1                1                1                1 
## 2215.85156931454 2222.69263113561 2223.81631591279 2231.38449751967 
##                5                1                1                2 
## 2244.67005617044  2251.7094004872 2252.23823777731 2253.36192255449 
##                2                1                1                1 
## 2254.88217232305 2268.16773097382 2296.39137034265 2297.11849012846 
##                1                1                1                1 
## 2297.31677262414 2298.44045740133 2299.56414217851 2312.65141833359 
##                1                1                1                1 
## 2313.77510311077 2315.09707038364 2327.98606404303 2330.62999858877 
##                1                1                1                2 
## 2334.43056087272  2336.4796479314 2343.32070975247  2359.5807577434 
##                1                1                1                1 
## 2367.34722184597 2373.99000117135 2375.11368594853 2404.65929259024 
##                2                1                1                1 
## 2419.99393829968 2426.43843512937 2435.32858400912 2443.09504811168 
##                4                1                1                1 
## 2446.36677310553 2449.73782743707 2464.14707086502 2480.20883636026 
##                1                1                1                1 
## 2487.97530046283 2493.49439501103 2503.30994617227 2509.02732321616 
##                1                1                1                1 
## 2539.49833213936 2555.95666262598 2691.72110445659 2932.97726169031 
##                1                1                1                1 
## 2984.89685835822 3044.91347392311 3183.12356780377 3371.73300858378 
##                1                1                1                1 
## 5633.57747351126 5654.23293127322 5670.88954425553  5708.2016149998 
##                1                1                1                1 
## 5729.98075753894 5731.30272481181 5776.91009694875 5786.92393060568 
##                1                1                1                1 
## 5798.68923948789 5802.25857631512 5827.70600883948 5835.07590795067 
##                1                1                1                1 
## 5844.69317661621  5868.9179712054 5898.86014283848 5899.58726262428 
##                1                1                1                1 
## 5909.60109628121 5916.04559311091 5930.05827154748 5930.65311903454 
##                1                1                1                1 
## 5937.82473565004 5939.74155040998 5949.98660952363  5952.0356965823 
##                1                1                1                1 
## 5959.80216068487 5960.72756296636 5961.32241045343 5974.40968660851 
##                1                1                1                1 
## 5983.82867277837 5985.34892254693 5991.39685438524 6004.55014073728 
##                1                1                1                1 
## 6006.73150009468 6006.92978259037  6011.3251923614 6013.24200712133 
##                2                1                1                1 
## 6027.78352284802 6036.87195422345 6038.82171194443 6042.91988606177 
##                1                1                1                1 
## 6045.56382060751 6050.68635016434 6062.84822403793 6073.78745997634 
##                1                1                1                1 
## 6074.18402496772 6080.62852179742  6082.4793263604 6088.19670340429 
##                2                1                1                1 
## 6090.24579046297 6092.59211317532 6096.69028729266 6098.21053706122 
##                1                1                1                1 
## 6105.05159888229 6111.82665050641 6112.22321549779 6121.31164687323 
##                1                1                1                1 
## 6125.11220915718 6127.35957871154  6128.8798284801 6138.59605030364 
##                1                1                2                1 
## 6142.16538713087 6159.44979056129 6164.47336696012 6173.76008083125 
##                1                1                1                1 
## 6179.08089288376 6186.64907449064 6195.14265837901 6209.55190180696 
##                1                1                1                1 
##  6218.4420506867 6223.96114523491 6233.77669639614 6239.09750844866 
##                1                1                1                1 
##  6241.0143232086 6249.30962460127 6261.07493348348 6262.19861826067 
##                1                1                1                1 
## 6267.71771280887 6272.60901690897  6278.4586662516 6279.45007873004 
##                1                1                1                1 
## 6294.38815944811 6320.03387446815 6322.61179881694 6323.53720109843 
##                1                1                1                1 
## 6330.77482791088 6346.44002841475 6348.35684317469   6354.404775013 
##                1                1                1                1 
## 6376.58048254352 6391.51856326158 6392.11341074865 6399.08674486846 
##                1                1                1                1 
## 6414.61967307359 6429.95431878303 6431.87113354297 6437.91906538128 
##                1                1                1                1 
## 6442.71104014369 6465.21730246862 6467.06810703161 6498.13396344187 
##                1                1                1                1 
## 6535.04946919476 6565.91704310933 6609.67361068329 6612.41649838702 
##                1                1                1                1 
## 6618.36547706735 6633.36956798237 6701.87976743563 6758.19477387456 
##                1                1                1                1 
## 6776.50390892417 6806.24779806156 6874.95628001051 6896.00830276384 
##                1                1                1                1 
## 6965.11334970418 6974.00349858392 7100.87779153479 9454.00301093311 
##                1                1                1                1 
## 9595.08826481624 9660.92158676273 9662.04527153991   9698.233657507 
##                1                1                1                1 
## 9723.05329958325 9820.91082318253 9851.58011460141 9865.98935802936 
##                1                1                1                1 
## 9896.65864944825 9903.10314627794   9911.795012662 9935.49096996107 
##                1                1                1                1 
## 9946.66143135621 9972.20819321827 9976.80188548498 9978.45440755228 
##                1                1                1                1 
## 10001.5555173643 10001.9520823557 10054.0039513223 10106.8489502717 
##                1                1                1                1 
##  10111.046077547 10152.5223326056 10168.9806630922 10193.4366831381 
##                1                1                1                1 
## 10217.1326404372 10257.7497553161 10262.4094577797 10273.0844010255 
##                1                1                1                1 
## 10288.6173292306 10304.1502574358 10344.7340531739 10354.5496043351 
##                1                1                1                1 
## 10528.9480841829 10538.8625885021 10563.0873830913 10568.3092419858 
##                1                1                1                1 
## 10586.7833403904 10610.9421247826 10719.8048945172 10863.5340829461 
##                1                1                1                1 
## 10901.0444361861 10922.6252962295 11051.2181214446 11051.9452412304 
##                1                1                1                1 
## 11193.4270601049 11347.3683646864 11639.7509886053 11896.1435090528 
##                1                1                1                1 
## 13390.6612172008 14039.6072649177 14126.0960446262 14184.0635731324 
##                1                1                1                1 
## 14281.7888244329 14314.5072029105  14439.134126307 14555.3667951528 
##                1                1                1                1 
## 14662.1804778288 14712.9103795226 14865.9262818226 14869.9251266022 
##                1                1                1                1 
## 14953.4394169705 15019.7682570664 15091.2529458598 15355.6102129055 
##                1                1                1                1 
## 17878.3419373892 18162.9580972055 18263.1290005561 18286.0318278724 
##                1                1                1                1 
## 18292.6746071978 18436.4037956267 18556.2387442608 18587.8994481581 
##                1                1                1                1 
## 19260.8720079685 
##                1
newDf <- salaryDatNoinfo
newDf$MonthlyIncome=predictedSal
#View(newDf)

newDf <- newDf[,c(1,36)]

out <- write.csv(newDf,"~/Documents/Case2PredictionsGoodwinSalary.csv",row.names = FALSE)