Introduction
Lately I have been indulged in learning all things tidymodels in my after office hours. But I was missing something - the effectiveness of my learning journey. Committing to large scale competition was unwieldy but then came #Sliced - a data science problem solving 2-hour sprint with small datasets.
Here, I am trying to tackle the S0102 problem with Aircraft wildlife strikes dataset.
Load Libraries and other presets
library(tidyverse) # Data Wrangling
library(tidymodels) # Modelling
library(themis) # Class imbalance
library(parallel) # Parallel operations
library(doParallel) # Parallel operations
library(tictoc) # Timing
library(GGally) # For pair plots
Enable parallel processing
cores<-detectCores(logical=F)-1
# cores
core_cluster<-makePSOCKcluster(cores)
# core_cluster
registerDoParallel(core_cluster)
Get Train and test data
Here I did a few things other than reading data - * Converted all character columns to factors to play nice with different models * Releveled factor so that our target outcome true label is “damaged”. It will make it easier to read the results.
Train data
train_orig<-read_csv("Data/S01E02/train.csv",
guess_max=1e5)%>%
mutate(
damaged=case_when(
damaged>0 ~ "damaged",
TRUE ~ "no damage"
),
across(where(is.character),as_factor),
damaged=fct_relevel(damaged,"damaged")
)
# Check out training data
train_orig%>%
glimpse()
## Rows: 21,000
## Columns: 34
## $ id <dbl> 23637, 8075, 5623, 19605, 15142, 27235, 12726, 20781,~
## $ incident_year <dbl> 1996, 1999, 2011, 2007, 2007, 2013, 2002, 2013, 2015,~
## $ incident_month <dbl> 11, 6, 12, 9, 9, 5, 5, 5, 7, 8, 10, 9, 11, 7, 5, 3, 3~
## $ incident_day <dbl> 7, 26, 1, 13, 13, 28, 4, 19, 22, 22, 21, 7, 2, 7, 20,~
## $ operator_id <fct> MIL, UAL, SWA, SWA, MIL, UNK, UAL, BUS, UNK, BUS, EGF~
## $ operator <fct> MILITARY, UNITED AIRLINES, SOUTHWEST AIRLINES, SOUTHW~
## $ aircraft <fct> T-1A, B-757-200, B-737-300, B-737-700, KC-135R, UNKNO~
## $ aircraft_type <fct> A, A, A, A, A, NA, A, A, NA, A, A, A, A, NA, A, A, A,~
## $ aircraft_make <fct> 748, 148, 148, 148, NA, NA, 148, 226, NA, NA, 332, 58~
## $ aircraft_model <dbl> NA, 26, 24, 42, NA, NA, 97, 7, NA, NA, 14, 22, 37, NA~
## $ aircraft_mass <dbl> 3, 4, 4, 4, NA, NA, 4, 1, NA, 1, 3, 4, 4, NA, 4, 4, 4~
## $ engine_make <dbl> 31, 34, 10, 10, NA, NA, 34, 7, NA, NA, 1, 34, 34, NA,~
## $ engine_model <fct> 1, 40, 1, 1, NA, NA, 46, 10, NA, NA, 10, 10, 10, NA, ~
## $ engines <dbl> 2, 2, 2, 2, NA, NA, 2, 1, NA, 2, 2, 2, 2, NA, 2, 2, 2~
## $ engine_type <fct> D, D, D, D, NA, NA, D, A, NA, C, D, D, D, NA, D, D, D~
## $ engine1_position <dbl> 5, 1, 1, 1, NA, NA, 1, 7, NA, 3, 5, 5, 5, NA, 1, 1, 5~
## $ engine2_position <dbl> 5, 1, 1, 1, NA, NA, 1, NA, NA, 3, 5, 5, 5, NA, 1, 1, ~
## $ engine3_position <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N~
## $ engine4_position <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N~
## $ airport_id <fct> KLBB, ZZZZ, KOAK, KSAT, KGFK, KMDT, KJFK, KIGQ, KEWR,~
## $ airport <fct> "LUBBOCK PRESTON SMITH INTL ARPT", "UNKNOWN", "METRO ~
## $ state <fct> TX, NA, CA, TX, ND, PA, NY, IL, NJ, UT, MO, MO, AZ, C~
## $ faa_region <fct> ASW, NA, AWP, ASW, AGL, AEA, AEA, AGL, AEA, ANM, ACE,~
## $ flight_phase <fct> LANDING ROLL, NA, LANDING ROLL, APPROACH, APPROACH, N~
## $ visibility <fct> DAY, NA, DAY, NIGHT, NIGHT, NA, NA, NIGHT, NA, NIGHT,~
## $ precipitation <fct> NA, NA, "NONE", "NONE", NA, NA, NA, "FOG", NA, NA, "N~
## $ height <dbl> 0, NA, 0, 300, NA, NA, NA, 2700, NA, 0, 3500, 1400, 0~
## $ speed <dbl> 80, NA, NA, 130, 140, NA, NA, 110, NA, NA, 180, 170, ~
## $ distance <dbl> 0, NA, 0, NA, NA, 0, NA, NA, 0, 0, NA, NA, 0, 0, 0, 0~
## $ species_id <fct> UNKBM, UNKBM, ZT002, UNKBS, ZT105, YI005, UNKBM, UNKB~
## $ species_name <fct> "UNKNOWN MEDIUM BIRD", "UNKNOWN MEDIUM BIRD", "WESTER~
## $ species_quantity <fct> 1, 1, 1, 1, NA, 1, 1, 1, 1, 1, 1, 2-10, 1, 1, 2-10, 1~
## $ flight_impact <fct> NA, NA, NONE, NONE, NA, NA, NA, NONE, NA, NA, NONE, N~
## $ damaged <fct> no damage, damaged, no damage, no damage, no damage, ~
Test data
test<-read_csv("Data/S01E02/test.csv",
guess_max = 1e5
)%>%
mutate(
across(where(is.character),as_factor)
)
Exploratory Data Analysis
Here I am not much focusing on EDA. But some charts will not hurt. Taken some queue from Julia Silge’s post and added some of my own on top.
Class balance check
The outcome is severely imbalanced. We will address that in pre-processing step.
train_orig%>%
count(damaged)%>%
ggplot(aes(damaged,n,fill=damaged))+
geom_col()+
geom_text(aes(label=n))
balance_share<-train_orig%>%
count(damaged)%>%
mutate(
share=n/sum(n)
)%>%
slice_max(share)%>%
select(share)%>%
pull()
balance_share # Will be using this on another viz
## [1] 0.9143333
Checking Pair plots of numeric variables
So many variables made the plot ugly. We see that some variables like speed, height . But for this model I will be skipping this pre-processing.
train_orig%>%
select(damaged,incident_year,aircraft_mass,
engines,contains("_position"),
height,speed,distance
)%>%
ggpairs(columns = 2:11,
aes(color=damaged,alpha=0.5),
progress=FALSE
)