Caret has long been the go-to package for machine learning with R. But it was not quite standardized like python counterpart scikit-learn. With tidymodels, this is about to change with caret developer Max Kuhn spearheading the project. In this project, I will run through the iris dataset to showcase the simplicity of tidymodels.
Loading libraries
Nothing special here except I will be using the promising DataExplorer and skimr packages for quick exploration.
library(tidyverse) # For Data wrangling
library(tidymodels) # For modeling
library(furrr) # For parallel processing
library(vip) # For variable importance
library(DataExplorer) # For quick exploration
library(skimr) # For quick exploration
Data
Quick summary of the data shows nothing unusual.
dataset<-iris
skim(dataset)
Name | dataset |
Number of rows | 150 |
Number of columns | 5 |
_______________________ | |
Column type frequency: | |
factor | 1 |
numeric | 4 |
________________________ | |
Group variables | None |
Variable type: factor
skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
---|---|---|---|---|---|
Species | 0 | 1 | FALSE | 3 | set: 50, ver: 50, vir: 50 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
Sepal.Length | 0 | 1 | 5.84 | 0.83 | 4.3 | 5.1 | 5.80 | 6.4 | 7.9 | ▆▇▇▅▂ |
Sepal.Width | 0 | 1 | 3.06 | 0.44 | 2.0 | 2.8 | 3.00 | 3.3 | 4.4 | ▁▆▇▂▁ |
Petal.Length | 0 | 1 | 3.76 | 1.77 | 1.0 | 1.6 | 4.35 | 5.1 | 6.9 | ▇▁▆▇▂ |
Petal.Width | 0 | 1 | 1.20 | 0.76 | 0.1 | 0.3 | 1.30 | 1.8 | 2.5 | ▇▁▇▅▃ |
EDA
No missing values but correlation plot reveals quite height correlation between petal length and petal width. We will discard one during our pre-processing phase.
plot_missing(dataset)
plot_histogram(dataset)
plot_correlation(dataset%>%select(-Species))
Split Data
Next, we split the data into train and test sets. And my validation will be done with 5-fold cross validation of train set using stratified sampling with column Species.
set.seed(111)
splits<-initial_split(dataset,prop=0.75)
splits
## <Analysis/Assess/Total>
## <113/37/150>
train<-training(splits)
test<-testing(splits)
cv_splits<-vfold_cv(train,v=5,strata='Species')
cv_splits
## # 5-fold cross-validation using stratification
## # A tibble: 5 x 2
## splits id
## <list> <chr>
## 1 <split [90/23]> Fold1
## 2 <split [90/23]> Fold2
## 3 <split [90/23]> Fold3
## 4 <split [90/23]> Fold4
## 5 <split [92/21]> Fold5
Preprocess with recipe
I will be doing minimal pre-processing with correlation filter to discard one of predictors with 0.9 or above correlation. Although normalizing is not necessary for tree based algorithms like Random Forest, I am using it so that I can use the same flow for other algorithims too.
rec<-recipe(Species~.,data=train)%>%
step_corr(all_predictors(),threshold = 0.9)%>%
step_normalize(all_numeric())
rec
## Data Recipe
##
## Inputs:
##
## role #variables
## outcome 1
## predictor 4
##
## Operations:
##
## Correlation filter on all_predictors()
## Centering and scaling for all_numeric()
Define model
Next we define a random forest model using the ranger package as engine. We keep three parameters for hyper-parameter optimization -
mtry: The number of predictors that will be randomly sampled at each2 split when creating the tree models.
trees: The number of trees contained in the ensemble.
min_n: The minimum number of data points in a node that are required for the node to be split further
rf_spec<-rand_forest()%>%
set_engine("ranger")%>%
set_mode("classification")%>%
set_args(mtry=tune(),trees=tune(),min_n=tune())
rf_spec
## Random Forest Model Specification (classification)
##
## Main Arguments:
## mtry = tune()
## trees = tune()
## min_n = tune()
##
## Computational engine: ranger
Putting into workflow
Next we put all this into a workflow which combines pre-processing, modelling and post-processing into a flexible chain.
rf_wf<-workflow()%>%
add_recipe(rec)%>%
add_model(rf_spec)
rf_wf
## == Workflow ====================================================================
## Preprocessor: Recipe
## Model: rand_forest()
##
## -- Preprocessor ----------------------------------------------------------------
## 2 Recipe Steps
##
## * step_corr()
## * step_normalize()
##
## -- Model -----------------------------------------------------------------------
## Random Forest Model Specification (classification)
##
## Main Arguments:
## mtry = tune()
## trees = tune()
## min_n = tune()
##
## Computational engine: ranger
Hyper parameter grid
After that we construct a hyper-parameter grid to make sense of the what good parameters value will be. A random grid of 20 combinations is here.
set.seed(111)
tune_res<-tune_grid(
rf_wf,
resamples = cv_splits,
grid=20
)
## i Creating pre-processing data to finalize unknown parameter: mtry
tune_res
## # Tuning results
## # 5-fold cross-validation using stratification
## # A tibble: 5 x 4
## splits id .metrics .notes
## <list> <chr> <list> <list>
## 1 <split [90/23]> Fold1 <tibble [40 x 7]> <tibble [0 x 1]>
## 2 <split [90/23]> Fold2 <tibble [40 x 7]> <tibble [0 x 1]>
## 3 <split [90/23]> Fold3 <tibble [40 x 7]> <tibble [0 x 1]>
## 4 <split [90/23]> Fold4 <tibble [40 x 7]> <tibble [0 x 1]>
## 5 <split [92/21]> Fold5 <tibble [40 x 7]> <tibble [0 x 1]>
Making initial sense
After running the grid on different folds of training set, we plotted the AUC score against each parameter value. It seems as mtry increases AUC score improves. min_n of 3-10 and tress count of 1000-1500 perform better.
tune_res%>%
collect_metrics()%>%
filter(.metric=="roc_auc")%>%
select(mean,mtry,trees,min_n)%>%
pivot_longer(mtry:min_n,
values_to="value",
names_to="parameter"
)%>%
ggplot(aes(x=value,y=mean,color=parameter))+
geom_point(show.legend = F)+
facet_wrap(~parameter,scales="free_x")+
labs(x=NULL,y="AUC")
Regular Grid
With those insights, lets run through a more controlled grid with 50 combinations.
rf_grid<-grid_regular(
mtry(range=c(2,3)),
trees(range=c(1000,1500)),
min_n(range=c(3,10)),
levels = 5
)
rf_grid
## # A tibble: 50 x 3
## mtry trees min_n
## <int> <int> <int>
## 1 2 1000 3
## 2 3 1000 3
## 3 2 1125 3
## 4 3 1125 3
## 5 2 1250 3
## 6 3 1250 3
## 7 2 1375 3
## 8 3 1375 3
## 9 2 1500 3
## 10 3 1500 3
## # ... with 40 more rows
tune_res2<-tune_grid(
rf_wf,
resamples = cv_splits,
grid=rf_grid
)
tune_res2
## # Tuning results
## # 5-fold cross-validation using stratification
## # A tibble: 5 x 4
## splits id .metrics .notes
## <list> <chr> <list> <list>
## 1 <split [90/23]> Fold1 <tibble [100 x 7]> <tibble [0 x 1]>
## 2 <split [90/23]> Fold2 <tibble [100 x 7]> <tibble [0 x 1]>
## 3 <split [90/23]> Fold3 <tibble [100 x 7]> <tibble [0 x 1]>
## 4 <split [90/23]> Fold4 <tibble [100 x 7]> <tibble [0 x 1]>
## 5 <split [92/21]> Fold5 <tibble [100 x 7]> <tibble [0 x 1]>
Best model
Looking at the best model, we see several models performed excellent with 0.995 AUC. Lets pick one with mtry=3, tress=1375 and min_n=3 as our best model and finalize the workflow.
tune_res2%>%
show_best()
## # A tibble: 5 x 9
## mtry trees min_n .metric .estimator mean n std_err .config
## <int> <int> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 3 1375 3 roc_auc hand_till 0.995 5 0.00223 Model08
## 2 3 1375 4 roc_auc hand_till 0.995 5 0.00223 Model18
## 3 3 1250 6 roc_auc hand_till 0.995 5 0.00223 Model26
## 4 3 1375 6 roc_auc hand_till 0.995 5 0.00223 Model28
## 5 3 1125 3 roc_auc hand_till 0.995 5 0.00238 Model04
rf_best<-tune_res2%>%
select_best(metric="roc_auc")
rf_best
## # A tibble: 1 x 4
## mtry trees min_n .config
## <int> <int> <int> <chr>
## 1 3 1375 3 Model08
rf_wf_final<-rf_wf%>%
finalize_workflow(rf_best)
rf_wf_final
## == Workflow ====================================================================
## Preprocessor: Recipe
## Model: rand_forest()
##
## -- Preprocessor ----------------------------------------------------------------
## 2 Recipe Steps
##
## * step_corr()
## * step_normalize()
##
## -- Model -----------------------------------------------------------------------
## Random Forest Model Specification (classification)
##
## Main Arguments:
## mtry = 3
## trees = 1375
## min_n = 3
##
## Computational engine: ranger
Fit on test data
Our model worked well on train set. Lets see how it performs in our unseen test set. Not bad with an AUC of 0.948.
rf_fit<-rf_wf_final%>%
last_fit(split=splits)
rf_fit%>%
collect_metrics()
## # A tibble: 2 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy multiclass 0.865
## 2 roc_auc hand_till 0.948
Looking at the AUC curve and confusion matrix, we see that setosa is perfectly detected while we were subpar in detecting versicolor and virginica.
rf_preds<-rf_fit%>%
collect_predictions()
rf_preds
## # A tibble: 37 x 7
## id .pred_setosa .pred_versicolor .pred_virginica .row .pred_class Species
## <chr> <dbl> <dbl> <dbl> <int> <fct> <fct>
## 1 trai~ 1 0 0 1 setosa setosa
## 2 trai~ 1 0 0 25 setosa setosa
## 3 trai~ 1 0 0 26 setosa setosa
## 4 trai~ 1 0 0 28 setosa setosa
## 5 trai~ 1 0 0 30 setosa setosa
## 6 trai~ 1 0 0 37 setosa setosa
## 7 trai~ 1 0 0 38 setosa setosa
## 8 trai~ 1 0 0 40 setosa setosa
## 9 trai~ 1 0 0 41 setosa setosa
## 10 trai~ 1 0 0 45 setosa setosa
## # ... with 27 more rows
rf_preds %>%
conf_mat(truth = Species, estimate = .pred_class)
## Truth
## Prediction setosa versicolor virginica
## setosa 11 0 0
## versicolor 0 12 1
## virginica 0 4 9
rf_preds%>%
roc_curve(truth=Species,estimate=.pred_setosa:.pred_virginica)%>%
autoplot()
That’s it folks. We covered lots of stuff and the model was not exactly excellent but the point is tidymodels is a breeze to work with and likely to become a integral part of any data enthusiast’s machine learning toolbox.
Sharing is caring. Share this story in...
Share: Twitter Facebook LinkedIn Pocket Flipboard