Learning from DataCamp's Car Insurance Claim Outcomes
I have been learning “Introduction to Linear Regression” on DataCamp for a few days and today was the day I applied what I have learned. More specifically, I learned Simple Linear Regression and Simple Logistic Regression in R, and the project I was tasked was predicting Car Insurance Claim Outcomes. During the learning session, I have little concern about implementing what was taught; but it took a turn when it comes to applying to a project.
The question for the project was simple:
Using Logistic Regression models, what single feature has the best predictive performance for a car insurance claim (“outcome”)?
The table below shows the structure of the dataset that I was working with. I adapted it as is from the project’s webpage.
| Column | Description |
|---|---|
id | Unique client identifier |
age | Client’s age: <ul><li> 0: 16-25</li><li>1: 26-39</li><li>2: 40-64</li><li>3: 65+</li></ul> |
driving_experience | Years the client has been driving: <ul><li> 0: 0-9</li><li>1: 10-19</li><li>2: 20-29</li><li>3: 30+</li></ul> |
education | Client’s level of education: <ul><li> 0: No education</li><li>1: High school</li><li>2: University</li></ul> |
income | Client’s income level: <ul><li> 0: Poverty</li><li>1: Working class</li><li>2: Middle class</li><li>3: Upper class</li></ul> |
credit_score | Client’s credit score (between zero and one) |
vehicle_ownership | Client’s vehicle ownership status: <ul><li> 0: Does not own their vehilce (paying off finance)</li><li>1: Owns their vehicle</li></ul> |
vehicle_year | Year of vehicle registration: <ul><li> 0: Before 2015</li><li>1: 2015 or later</li></ul> |
married | Client’s marital status: <ul><li> 0: Not married</li><li>1: Married</li></ul> |
children | Client’s number of children |
postal_code | Client’s postal code |
annual_mileage | Number of miles driven by the client each year |
vehicle_type | Type of car: <ul><li> 0: Sedan</li><li>1: Sports car</li></ul> |
speeding_violations | Total number of speeding violations received by the client |
duis | Number of times the client has been caught driving under the influence of alcohol |
past_accidents | Total number of previous accidents the client has been involved in |
outcome | Whether the client made a claim on their car insurance (response variable): <ul><li> 0: No claim</li><li>1: Made a claim</li></ul> |
My initial thought was to perform logistic regression on one variable, to see whether my code work well, and then creating a function which then looped in a for-loop operation.
glm() to model age vs outcome from the dataset while specifying binomial as the distribution.outcome given age based on the model I generated in the first step.pred.mutate a proportion (pred / (1-pred)) and store it in a new column.However, I stucked at step 5.
After trying for quite some time, I gave up and looked on the solution instead. I decided to retrace the solution to get more insights.
As a disclaimer, I didn’t follow the solution blindly or copy-pasted it as is. Instead, I tried to analyze and reason every line of code to understand the workflow to performing logistic regression. The following is the code I wrote (or rather, imitated).
# Import library
library(readr)
library(dplyr)
# Import data
car_insurance <- read_csv("/car_insurance.csv")
# Inspecting for NA values
car_insurance %>%
filter(if_any(everything(), is.na))
# # Get all column name except `outcome` and `id`
features_chr <- car_insurance %>%
select(-id, -outcome) %>%
colnames()
# Store the features into a dataframe
features_df <- data.frame(features = features_chr)
In the retracing of the solution, I wrote a “For Loop logic” to evaluate and clarify how far I understood the workflow.
### For Loop logic
1. Get `features` from dataset.
2. Store `features` into a dataframe variable.
3. Get `features` in a vector format to be used in the for-loop operation.
=== start loop from here ===
4. Generate GLM for each feature to `outcome`.
5. Run prediction and Store it in a vector.
6. Get proportion of prediction to actual and Store it to `accuracy` vector variable.
7. Update the respective feature (in `features_df`) with the value of `accuracy` variable.
=== end loop ===
8. See the final **features** dataframe.
9. Arrange descendingly on `accuracy` to see which feature best predict the `outcome`.
# Load the glue package
library(glue)
# Initiate the for-loop operation
for (f in features_df$features){
model <- glm(glue("outcome ~ {f}"), data = car_insurance, family = "binomial")
pred <- round(fitted(model))
accuracy <- length(which(pred == car_insurance$outcome)) / length(car_insurance$outcome)
features_df[which(features_df$features == f), "accuracy"] = accuracy
}
This is where I question my basic skill in data wrangling in R.
glue() can be used inside a for-loop.fitted() function do.which() function.length() function can be applied.If I had to guess how each of the functions work, I would say:
The glue() function concatenates all the supplied arguments and engrave it into the code as if the concatenation was the user’s input. In this case, the glue("outcome ~ {f}") takes whatever is in the for statement and then concatenates it into the code. So, if the first element of f inside the features_df$features (that is encased inside a for-loop statement) is age, it will appear to the machine: glm(outcome ~ age, data = car_insurance, family = "binomial"). And if the first iteration is complete, it loops the second element, and so on and so forth until the very last element in the list.
The fitted() function, I assume that this function simplifies the longer version of performing logistic regression for predictive statistics. So, in the model variable, we can access both the resulting model or the data that is being used to model the relationship. The fitted() function takes the explanatory variable inside the raw data and applies the resulted model into the existing response variable. We still have our raw data inside the model variable but we have our prediction result inside the pred variable. Nothing was lost. The round() function uses the banker’s rounding algorithm. So, inside the pred variable, we have our raw explanatory column and a predicted response column. It’s the simplified procedure from the longer, manual one.
The which() function, I suspect, takes a conditional argument and returns the rows that satisfy the condition(s).
The length() function is different from that of len() in which case the former counts the number of existing rows while the latter counts the number of character in a string value. Nesting which() inside length() will returns the number of rows that satisfy the specified condition(s).
# Print the dataframe
features_df
# Return the feature with highest accuracy
features_df %>%
arrange(desc(accuracy)) %>%
slice(1)
Here are some more articles you might like to read next: