Caret has long been the go-to package for machine learning with R. But it was not quite standardized like python counterpart scikit-learn. With tidymodels, this is about to change with caret developer Max Kuhn spearheading the project. In this project, I will run through the iris dataset to showcase the simplicity of tidymodels.

Loading libraries

Nothing special here except I will be using the promising DataExplorer and skimr packages for quick exploration.

library(tidyverse) # For Data wrangling
library(tidymodels) # For modeling 
library(furrr) # For parallel processing
library(vip) # For variable importance

library(DataExplorer) # For quick exploration
library(skimr) # For quick exploration

Data

Quick summary of the data shows nothing unusual.

dataset<-iris

skim(dataset)
Data summary
Name dataset
Number of rows 150
Number of columns 5
_______________________
Column type frequency:
factor 1
numeric 4
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
Species 0 1 FALSE 3 set: 50, ver: 50, vir: 50

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Sepal.Length 0 1 5.84 0.83 4.3 5.1 5.80 6.4 7.9 ▆▇▇▅▂
Sepal.Width 0 1 3.06 0.44 2.0 2.8 3.00 3.3 4.4 ▁▆▇▂▁
Petal.Length 0 1 3.76 1.77 1.0 1.6 4.35 5.1 6.9 ▇▁▆▇▂
Petal.Width 0 1 1.20 0.76 0.1 0.3 1.30 1.8 2.5 ▇▁▇▅▃

EDA

No missing values but correlation plot reveals quite height correlation between petal length and petal width. We will discard one during our pre-processing phase.

plot_missing(dataset)

plot_histogram(dataset)

plot_correlation(dataset%>%select(-Species))

Split Data

Next, we split the data into train and test sets. And my validation will be done with 5-fold cross validation of train set using stratified sampling with column Species.

set.seed(111)
splits<-initial_split(dataset,prop=0.75)

splits
## <Analysis/Assess/Total>
## <113/37/150>
train<-training(splits)
test<-testing(splits)

cv_splits<-vfold_cv(train,v=5,strata='Species')

cv_splits
## #  5-fold cross-validation using stratification 
## # A tibble: 5 x 2
##   splits          id   
##   <list>          <chr>
## 1 <split [90/23]> Fold1
## 2 <split [90/23]> Fold2
## 3 <split [90/23]> Fold3
## 4 <split [90/23]> Fold4
## 5 <split [92/21]> Fold5

Preprocess with recipe

I will be doing minimal pre-processing with correlation filter to discard one of predictors with 0.9 or above correlation. Although normalizing is not necessary for tree based algorithms like Random Forest, I am using it so that I can use the same flow for other algorithims too.

rec<-recipe(Species~.,data=train)%>%
    step_corr(all_predictors(),threshold = 0.9)%>%
    step_normalize(all_numeric())

rec
## Data Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor          4
## 
## Operations:
## 
## Correlation filter on all_predictors()
## Centering and scaling for all_numeric()

Define model

Next we define a random forest model using the ranger package as engine. We keep three parameters for hyper-parameter optimization -

  1. mtry: The number of predictors that will be randomly sampled at each2 split when creating the tree models.

  2. trees: The number of trees contained in the ensemble.

  3. min_n: The minimum number of data points in a node that are required for the node to be split further

rf_spec<-rand_forest()%>%
    set_engine("ranger")%>%
    set_mode("classification")%>%
    set_args(mtry=tune(),trees=tune(),min_n=tune())

rf_spec
## Random Forest Model Specification (classification)
## 
## Main Arguments:
##   mtry = tune()
##   trees = tune()
##   min_n = tune()
## 
## Computational engine: ranger

Putting into workflow

Next we put all this into a workflow which combines pre-processing, modelling and post-processing into a flexible chain.

rf_wf<-workflow()%>%
    add_recipe(rec)%>%
    add_model(rf_spec)

rf_wf
## == Workflow ====================================================================
## Preprocessor: Recipe
## Model: rand_forest()
## 
## -- Preprocessor ----------------------------------------------------------------
## 2 Recipe Steps
## 
## * step_corr()
## * step_normalize()
## 
## -- Model -----------------------------------------------------------------------
## Random Forest Model Specification (classification)
## 
## Main Arguments:
##   mtry = tune()
##   trees = tune()
##   min_n = tune()
## 
## Computational engine: ranger

Hyper parameter grid

After that we construct a hyper-parameter grid to make sense of the what good parameters value will be. A random grid of 20 combinations is here.

set.seed(111)

tune_res<-tune_grid(
    rf_wf,
    resamples = cv_splits,
    grid=20
)
## i Creating pre-processing data to finalize unknown parameter: mtry
tune_res
## # Tuning results
## # 5-fold cross-validation using stratification 
## # A tibble: 5 x 4
##   splits          id    .metrics          .notes          
##   <list>          <chr> <list>            <list>          
## 1 <split [90/23]> Fold1 <tibble [40 x 7]> <tibble [0 x 1]>
## 2 <split [90/23]> Fold2 <tibble [40 x 7]> <tibble [0 x 1]>
## 3 <split [90/23]> Fold3 <tibble [40 x 7]> <tibble [0 x 1]>
## 4 <split [90/23]> Fold4 <tibble [40 x 7]> <tibble [0 x 1]>
## 5 <split [92/21]> Fold5 <tibble [40 x 7]> <tibble [0 x 1]>

Making initial sense

After running the grid on different folds of training set, we plotted the AUC score against each parameter value. It seems as mtry increases AUC score improves. min_n of 3-10 and tress count of 1000-1500 perform better.

tune_res%>%
    collect_metrics()%>%
    filter(.metric=="roc_auc")%>%
    select(mean,mtry,trees,min_n)%>%
    pivot_longer(mtry:min_n,
        values_to="value",
        names_to="parameter"
    )%>%
    ggplot(aes(x=value,y=mean,color=parameter))+
    geom_point(show.legend = F)+
    facet_wrap(~parameter,scales="free_x")+
    labs(x=NULL,y="AUC")

Regular Grid

With those insights, lets run through a more controlled grid with 50 combinations.

rf_grid<-grid_regular(
    mtry(range=c(2,3)),
    trees(range=c(1000,1500)),
    min_n(range=c(3,10)),
    levels = 5
)


rf_grid
## # A tibble: 50 x 3
##     mtry trees min_n
##    <int> <int> <int>
##  1     2  1000     3
##  2     3  1000     3
##  3     2  1125     3
##  4     3  1125     3
##  5     2  1250     3
##  6     3  1250     3
##  7     2  1375     3
##  8     3  1375     3
##  9     2  1500     3
## 10     3  1500     3
## # ... with 40 more rows
tune_res2<-tune_grid(
    rf_wf,
    resamples = cv_splits,
    grid=rf_grid
)


tune_res2
## # Tuning results
## # 5-fold cross-validation using stratification 
## # A tibble: 5 x 4
##   splits          id    .metrics           .notes          
##   <list>          <chr> <list>             <list>          
## 1 <split [90/23]> Fold1 <tibble [100 x 7]> <tibble [0 x 1]>
## 2 <split [90/23]> Fold2 <tibble [100 x 7]> <tibble [0 x 1]>
## 3 <split [90/23]> Fold3 <tibble [100 x 7]> <tibble [0 x 1]>
## 4 <split [90/23]> Fold4 <tibble [100 x 7]> <tibble [0 x 1]>
## 5 <split [92/21]> Fold5 <tibble [100 x 7]> <tibble [0 x 1]>

Best model

Looking at the best model, we see several models performed excellent with 0.995 AUC. Lets pick one with mtry=3, tress=1375 and min_n=3 as our best model and finalize the workflow.

tune_res2%>%
    show_best()
## # A tibble: 5 x 9
##    mtry trees min_n .metric .estimator  mean     n std_err .config
##   <int> <int> <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>  
## 1     3  1375     3 roc_auc hand_till  0.995     5 0.00223 Model08
## 2     3  1375     4 roc_auc hand_till  0.995     5 0.00223 Model18
## 3     3  1250     6 roc_auc hand_till  0.995     5 0.00223 Model26
## 4     3  1375     6 roc_auc hand_till  0.995     5 0.00223 Model28
## 5     3  1125     3 roc_auc hand_till  0.995     5 0.00238 Model04
rf_best<-tune_res2%>%
    select_best(metric="roc_auc")


rf_best
## # A tibble: 1 x 4
##    mtry trees min_n .config
##   <int> <int> <int> <chr>  
## 1     3  1375     3 Model08
rf_wf_final<-rf_wf%>%
    finalize_workflow(rf_best)

rf_wf_final
## == Workflow ====================================================================
## Preprocessor: Recipe
## Model: rand_forest()
## 
## -- Preprocessor ----------------------------------------------------------------
## 2 Recipe Steps
## 
## * step_corr()
## * step_normalize()
## 
## -- Model -----------------------------------------------------------------------
## Random Forest Model Specification (classification)
## 
## Main Arguments:
##   mtry = 3
##   trees = 1375
##   min_n = 3
## 
## Computational engine: ranger

Fit on test data

Our model worked well on train set. Lets see how it performs in our unseen test set. Not bad with an AUC of 0.948.

rf_fit<-rf_wf_final%>%
    last_fit(split=splits)


rf_fit%>%
    collect_metrics()
## # A tibble: 2 x 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy multiclass     0.865
## 2 roc_auc  hand_till      0.948

Looking at the AUC curve and confusion matrix, we see that setosa is perfectly detected while we were subpar in detecting versicolor and virginica.

rf_preds<-rf_fit%>%
    collect_predictions()

rf_preds
## # A tibble: 37 x 7
##    id    .pred_setosa .pred_versicolor .pred_virginica  .row .pred_class Species
##    <chr>        <dbl>            <dbl>           <dbl> <int> <fct>       <fct>  
##  1 trai~            1                0               0     1 setosa      setosa 
##  2 trai~            1                0               0    25 setosa      setosa 
##  3 trai~            1                0               0    26 setosa      setosa 
##  4 trai~            1                0               0    28 setosa      setosa 
##  5 trai~            1                0               0    30 setosa      setosa 
##  6 trai~            1                0               0    37 setosa      setosa 
##  7 trai~            1                0               0    38 setosa      setosa 
##  8 trai~            1                0               0    40 setosa      setosa 
##  9 trai~            1                0               0    41 setosa      setosa 
## 10 trai~            1                0               0    45 setosa      setosa 
## # ... with 27 more rows
rf_preds %>% 
    conf_mat(truth = Species, estimate = .pred_class)
##             Truth
## Prediction   setosa versicolor virginica
##   setosa         11          0         0
##   versicolor      0         12         1
##   virginica       0          4         9
rf_preds%>%
    roc_curve(truth=Species,estimate=.pred_setosa:.pred_virginica)%>%
    autoplot()

That’s it folks. We covered lots of stuff and the model was not exactly excellent but the point is tidymodels is a breeze to work with and likely to become a integral part of any data enthusiast’s machine learning toolbox.