Introduction

Lately I have been indulged in learning all things tidymodels in my after office hours. But I was missing something - the effectiveness of my learning journey. Committing to large scale competition was unwieldy but then came #Sliced - a data science problem solving 2-hour sprint with small datasets.

Here, I am trying to tackle the S0102 problem with Aircraft wildlife strikes dataset.

Load Libraries and other presets

library(tidyverse) # Data Wrangling
library(tidymodels) # Modelling
library(themis) # Class imbalance

library(parallel) # Parallel operations
library(doParallel) # Parallel operations
library(tictoc) # Timing

library(GGally) # For pair plots

Enable parallel processing

cores<-detectCores(logical=F)-1
# cores

core_cluster<-makePSOCKcluster(cores)
# core_cluster

registerDoParallel(core_cluster)

Get Train and test data

Here I did a few things other than reading data - * Converted all character columns to factors to play nice with different models * Releveled factor so that our target outcome true label is “damaged”. It will make it easier to read the results.

Train data

train_orig<-read_csv("Data/S01E02/train.csv",
                     guess_max=1e5)%>%
  mutate(
    damaged=case_when(
      damaged>0 ~ "damaged",
      TRUE ~ "no damage"
    ),
    across(where(is.character),as_factor),
    damaged=fct_relevel(damaged,"damaged")
  )


# Check out training data
train_orig%>%
  glimpse()
## Rows: 21,000
## Columns: 34
## $ id               <dbl> 23637, 8075, 5623, 19605, 15142, 27235, 12726, 20781,~
## $ incident_year    <dbl> 1996, 1999, 2011, 2007, 2007, 2013, 2002, 2013, 2015,~
## $ incident_month   <dbl> 11, 6, 12, 9, 9, 5, 5, 5, 7, 8, 10, 9, 11, 7, 5, 3, 3~
## $ incident_day     <dbl> 7, 26, 1, 13, 13, 28, 4, 19, 22, 22, 21, 7, 2, 7, 20,~
## $ operator_id      <fct> MIL, UAL, SWA, SWA, MIL, UNK, UAL, BUS, UNK, BUS, EGF~
## $ operator         <fct> MILITARY, UNITED AIRLINES, SOUTHWEST AIRLINES, SOUTHW~
## $ aircraft         <fct> T-1A, B-757-200, B-737-300, B-737-700, KC-135R, UNKNO~
## $ aircraft_type    <fct> A, A, A, A, A, NA, A, A, NA, A, A, A, A, NA, A, A, A,~
## $ aircraft_make    <fct> 748, 148, 148, 148, NA, NA, 148, 226, NA, NA, 332, 58~
## $ aircraft_model   <dbl> NA, 26, 24, 42, NA, NA, 97, 7, NA, NA, 14, 22, 37, NA~
## $ aircraft_mass    <dbl> 3, 4, 4, 4, NA, NA, 4, 1, NA, 1, 3, 4, 4, NA, 4, 4, 4~
## $ engine_make      <dbl> 31, 34, 10, 10, NA, NA, 34, 7, NA, NA, 1, 34, 34, NA,~
## $ engine_model     <fct> 1, 40, 1, 1, NA, NA, 46, 10, NA, NA, 10, 10, 10, NA, ~
## $ engines          <dbl> 2, 2, 2, 2, NA, NA, 2, 1, NA, 2, 2, 2, 2, NA, 2, 2, 2~
## $ engine_type      <fct> D, D, D, D, NA, NA, D, A, NA, C, D, D, D, NA, D, D, D~
## $ engine1_position <dbl> 5, 1, 1, 1, NA, NA, 1, 7, NA, 3, 5, 5, 5, NA, 1, 1, 5~
## $ engine2_position <dbl> 5, 1, 1, 1, NA, NA, 1, NA, NA, 3, 5, 5, 5, NA, 1, 1, ~
## $ engine3_position <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N~
## $ engine4_position <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N~
## $ airport_id       <fct> KLBB, ZZZZ, KOAK, KSAT, KGFK, KMDT, KJFK, KIGQ, KEWR,~
## $ airport          <fct> "LUBBOCK PRESTON SMITH INTL ARPT", "UNKNOWN", "METRO ~
## $ state            <fct> TX, NA, CA, TX, ND, PA, NY, IL, NJ, UT, MO, MO, AZ, C~
## $ faa_region       <fct> ASW, NA, AWP, ASW, AGL, AEA, AEA, AGL, AEA, ANM, ACE,~
## $ flight_phase     <fct> LANDING ROLL, NA, LANDING ROLL, APPROACH, APPROACH, N~
## $ visibility       <fct> DAY, NA, DAY, NIGHT, NIGHT, NA, NA, NIGHT, NA, NIGHT,~
## $ precipitation    <fct> NA, NA, "NONE", "NONE", NA, NA, NA, "FOG", NA, NA, "N~
## $ height           <dbl> 0, NA, 0, 300, NA, NA, NA, 2700, NA, 0, 3500, 1400, 0~
## $ speed            <dbl> 80, NA, NA, 130, 140, NA, NA, 110, NA, NA, 180, 170, ~
## $ distance         <dbl> 0, NA, 0, NA, NA, 0, NA, NA, 0, 0, NA, NA, 0, 0, 0, 0~
## $ species_id       <fct> UNKBM, UNKBM, ZT002, UNKBS, ZT105, YI005, UNKBM, UNKB~
## $ species_name     <fct> "UNKNOWN MEDIUM BIRD", "UNKNOWN MEDIUM BIRD", "WESTER~
## $ species_quantity <fct> 1, 1, 1, 1, NA, 1, 1, 1, 1, 1, 1, 2-10, 1, 1, 2-10, 1~
## $ flight_impact    <fct> NA, NA, NONE, NONE, NA, NA, NA, NONE, NA, NA, NONE, N~
## $ damaged          <fct> no damage, damaged, no damage, no damage, no damage, ~

Test data

test<-read_csv("Data/S01E02/test.csv",
               guess_max = 1e5
               )%>%
  mutate(
    across(where(is.character),as_factor)
  )

Exploratory Data Analysis

Here I am not much focusing on EDA. But some charts will not hurt. Taken some queue from Julia Silge’s post and added some of my own on top.

Class balance check

The outcome is severely imbalanced. We will address that in pre-processing step.

train_orig%>%
  count(damaged)%>%
  ggplot(aes(damaged,n,fill=damaged))+
  geom_col()+
  geom_text(aes(label=n))

balance_share<-train_orig%>%
  count(damaged)%>%
  mutate(
    share=n/sum(n)
  )%>%
  slice_max(share)%>%
  select(share)%>%
  pull()

balance_share # Will be using this on another viz
## [1] 0.9143333

Checking Pair plots of numeric variables

So many variables made the plot ugly. We see that some variables like speed, height . But for this model I will be skipping this pre-processing.

train_orig%>%
  select(damaged,incident_year,aircraft_mass,
         engines,contains("_position"),
         height,speed,distance
         )%>%
  ggpairs(columns = 2:11,
          aes(color=damaged,alpha=0.5),
          progress=FALSE
          )