Side Project: Titanic Data Set

Introduction of the Data Set

The RMS Titanic is famously remembered as one of the most tragic maritime disasters in history. Considered an engineering marvel of its time, the Titanic was the largest ship afloat when it embarked on its maiden voyage from Southampton, England, to New York City on April 10, 1912. It was celebrated for its supposed “unsinkable” design, featuring advanced safety features and lavish accommodations.

On the night of April 14, 1912, the Titanic struck an iceberg in the North Atlantic. Regrettably in the early morning of April 15, the ship sank, resulting in the loss of 1,502 of the 2,224 passengers and crew on board recored to be on board. The high casualty rate was due in part to insufficient lifeboats and a lack of readiness for such a catastrophe.

Though through the chaos and luck of surviving the sinking, there was a sense of chance for who survived and who didn’t. There are also brief outlines of certain demographics that surface (as opposed to sinking) in the survival of the passengers aboard. These common themes can be investigated using the Titanic dataset with the following variables as described below. The goal of this project is to use these variables to predict which passengers survived and which did not.

library(knitr)
data_dictionary <- data.frame(
  Variable = c("survival", "pclass", "sex", "age", "sibsp", "parch", "ticket", "fare", "cabin", "embarked"),
  
  Definition = c("Survival", "Ticket class", "Sex", "Age in years", "# of siblings/spouses aboard the Titanic",
                 "# of parents/children aboard the Titanic","Ticket number", "Passenger fare", "Cabin number", "Port of Embarkation"),
  
  Key = c("0 = No, 1 = Yes",
          "1 = 1st, 2 = 2nd, 3 = 3rd", "", "", "", "", "", "", "", "C = Cherbourg, Q = Queenstown, S = Southampton"))

kable(data_dictionary, align = "r", caption = "Titanic Variable Definitions")

Titanic Variable Definitions
Variable	Definition	Key
survival	Survival	0 = No, 1 = Yes
pclass	Ticket class	1 = 1st, 2 = 2nd, 3 = 3rd
sex	Sex
age	Age in years
sibsp	# of siblings/spouses aboard the Titanic
parch	# of parents/children aboard the Titanic
ticket	Ticket number
fare	Passenger fare
cabin	Cabin number
embarked	Port of Embarkation	C = Cherbourg, Q = Queenstown, S = Southampton

library(ggplot2)
library("tidyverse")

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.6
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ lubridate 1.9.4     ✔ tibble    3.3.0
## ✔ purrr     1.2.0     ✔ tidyr     1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(knitr)
library(knitr)
library(dplyr)
library(stringr)
library(class)
library(tidyr)
library(rpart)
library(randomForest)

## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## 
## The following object is masked from 'package:dplyr':
## 
##     combine
## 
## The following object is masked from 'package:ggplot2':
## 
##     margin

library(pROC)

## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## 
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

library(gbm)

## Loaded gbm 2.2.2
## This version of gbm is no longer under development. Consider transitioning to gbm3, https://github.com/gbm-developers/gbm3

library(glmnet)

## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## 
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
## 
## Loaded glmnet 4.1-10

library(mgcv)

## Loading required package: nlme
## 
## Attaching package: 'nlme'
## 
## The following object is masked from 'package:dplyr':
## 
##     collapse
## 
## This is mgcv 1.9-3. For overview type 'help("mgcv-package")'.

library(dbarts)

## 
## Attaching package: 'dbarts'
## 
## The following object is masked from 'package:tidyr':
## 
##     extract

library(xgboost)

Pre processing steps

test_file <- "/Users/miada/Downloads/STT 481/DataSets/titanic_test.csv"
train_file <- "/Users/miada/Downloads/STT 481/DataSets/titanic_train.csv"

test <- read.csv(test_file)
train <- read.csv(train_file) 

# glimpse(test)
# glimpse(train)

# dim(train); dim(test)

I began by loading the Titanic training and test datasets and checking their dimensions to confirm the files were read correctly and to understand how many observations and variables were available before preprocessing.

train <- train %>%
  mutate( across(where(is.character), ~ na_if(str_trim(.), "")),
          across(where(is.factor), ~
                   factor(na_if(str_trim(as.character(.)), ""))))

miss_tbl <- train %>%
  summarise(across(everything(), ~ sum(is.na(.)))) %>%
  tidyr::pivot_longer(everything(),
                      names_to = "Feature",
                      values_to = "Missing_Count") %>%
  mutate(Missing_Percent = round(Missing_Count / nrow(train) * 100, 2)) %>%
  arrange(desc(Missing_Percent))


miss_tbl

## # A tibble: 12 × 3
##    Feature     Missing_Count Missing_Percent
##    <chr>               <int>           <dbl>
##  1 Cabin                 687           77.1 
##  2 Age                   177           19.9 
##  3 Embarked                2            0.22
##  4 PassengerId             0            0   
##  5 Survived                0            0   
##  6 Pclass                  0            0   
##  7 Name                    0            0   
##  8 Sex                     0            0   
##  9 SibSp                   0            0   
## 10 Parch                   0            0   
## 11 Ticket                  0            0   
## 12 Fare                    0            0

To assess the quality of the Titanic dataset prior to modeling, I conducted an exploratory analysis focused on identifying and quantifying missing data across all variables. I first cleaned the dataset by trimming whitespace and converting empty or blank character and factor values into proper missing values (NA). I then calculated both the total count and percentage of missing observations for each feature. The results reveal substantial missingness in the Cabin variable, with 687 missing values representing approximately 77.1% of the dataset, indicating that cabin information was largely unrecorded and may be of limited utility for predictive modeling. The Age variable also contains a notable proportion of missing data, with 177 missing values (19.87%), suggesting that age-based analyses or modeling will require imputation or careful handling. In contrast, the Embarked variable has only two missing values (0.22%), while all remaining variables—including passenger class, sex, family relationships, and survival outcome—are complete. Overall, this analysis highlights key data limitations while confirming that most core demographic and outcome variables are sufficiently complete to support survival prediction modeling.

train %>%
  mutate(Cabin_Recorded = !is.na(Cabin)) %>%
  group_by(Pclass) %>%
  summarise(
    Total_Passengers = n(),
    Cabin_Recorded_Count = sum(Cabin_Recorded),
    Cabin_Recorded_Percent = round(mean(Cabin_Recorded) * 100, 2)
  )

## # A tibble: 3 × 4
##   Pclass Total_Passengers Cabin_Recorded_Count Cabin_Recorded_Percent
##    <int>            <int>                <int>                  <dbl>
## 1      1              216                  176                  81.5 
## 2      2              184                   16                   8.7 
## 3      3              491                   12                   2.44

Cabin information was primarily recorded for higher-class passengers, reflecting the documentation practices aboard the Titanic. Historical accounts and prior analyses show that passengers traveling in higher classes experienced substantially higher survival rates compared to those in lower classes. In my exploratory analysis, I found that cabin numbers were overwhelmingly recorded for first-class passengers (at 81.48%) and were rarely documented for lower classes, indicating that the Cabin variable is strongly confounded with both passenger class and survival outcome. Including Cabin in the analysis would therefore introduce class-based bias by disproportionately representing higher-class passengers rather than providing independent predictive information. Also, keeping Cabin would require removing a large proportion of observations due to its high rate of missingness, which woudld end up reducing the effective sample size. Under the bias variance trade off, this reduction would increase model variance and decrease predictive accuracy. For these reasons, I excluded the Cabin variable from modeling to preserve dataset representativeness, maintain sufficient sample size, and avoid reinforcing survival patterns already captured by other variables.

train <- train %>%
  select(-Cabin)

test <- test %>%
  select(-Cabin)

train <- train %>%
  group_by(Sex, Pclass) %>%
  mutate(Age = ifelse(is.na(Age), median(Age, na.rm = TRUE), Age)) %>%
  ungroup()

test <- test %>%
  group_by(Sex, Pclass) %>%
  mutate(Age = ifelse(is.na(Age), median(Age, na.rm = TRUE), Age)) %>%
  ungroup()

Because age was missing for approximately 21% of passengers, removing observations with missing age values would have substantially reduced the effective sample size and increased model variance, particularly for a distance based method such as KNN. Instead, I imputed missing age values using the median age within sex and passenger class groups, preserving both sample while avoiding excessive bias.

train_y <- train$Survived
train_x <- train %>% select(-Survived)

combined <- bind_rows(train_x, test)

combined_dummies <- model.matrix(~ . - 1, data = combined)

n_train <- nrow(train_x)
x_train <- combined_dummies[1:n_train, , drop = FALSE]
x_test  <- combined_dummies[(n_train + 1):nrow(combined_dummies), , drop = FALSE]

To prepare the data for K-nearest neighbors modeling, I converted all predictor variables into a fully numeric format. I then combined the training and testing predictor datasets and used one hot encoding to create dummy variables for all categorical features, ensuring that both datasets shared the same numeric feature structure. This step guarantees consistent columns between the training and testing sets and avoids issues arising from mismatched factor levels. After encoding, I split the combined data back into their original training and testing components. This process resulted in numeric matrices suitable for distance-based modeling, allowing KNN to be applied without coercion errors or inconsistencies between datasets.

library(tidyr)
train_complete <- complete.cases(x_train)
test_complete  <- complete.cases(x_test)

x_train <- x_train[train_complete, , drop = FALSE]
train_y <- train_y[train_complete]

x_test <- x_test[test_complete, , drop = FALSE]

# any(is.na(x_train))
# any(is.na(x_test))
# colSums(is.na(x_train))[colSums(is.na(x_train)) > 0]

After imputing missing age values and creating those dummy variables, I removed any remaining observations with incomplete predictor information. I kept only complete cases in the processed training and testing datasets to ensure that all inputs to the KNN model were fully observed and suitable for distance-based calculations, while minimizing data loss after age imputation.

K-Nearest Neighbor

set.seed(10)


n <- nrow(x_train)
val_idx   <- sample(seq_len(n), size = floor(0.2 * n))
train_idx <- setdiff(seq_len(n), val_idx)

X_train_raw <- x_train[train_idx, , drop = FALSE]
X_val_raw   <- x_train[val_idx, , drop = FALSE]
y_train <- factor(train_y[train_idx], levels = c(0, 1))
y_val   <- factor(train_y[val_idx],   levels = c(0, 1))


keep_cols <- apply(X_train_raw, 2, sd) > 0
X_train_raw <- X_train_raw[, keep_cols, drop = FALSE]
X_val_raw   <- X_val_raw[,   keep_cols, drop = FALSE]
X_test_raw  <- x_test[,      keep_cols, drop = FALSE]


center <- colMeans(X_train_raw)
scalev <- apply(X_train_raw, 2, sd)
scalev[scalev == 0] <- 1

scale_mat <- function(M, center, scalev) {
scale(M, center = center, scale = scalev)
}

X_train <- scale_mat(X_train_raw, center, scalev)
X_val   <- scale_mat(X_val_raw,   center, scalev)
X_test  <- scale_mat(X_test_raw,  center, scalev)

After preprocessing, I created an 80/20 train/validation split so that I could tune and evaluate the KNN model fairly. I also scaled the predictors using the mean and standard deviation computed from the training split only, then applied those same scaling values to the validation and test sets to avoid data leakage.

k_grid  <- seq(1, 25, by = 2)
cv_err  <- numeric(length(k_grid))

# 10-fold CV on the 80% training split

set.seed(10)
K <- 10
folds <- sample(rep(1:K, length.out = nrow(X_train)))

for (i in seq_along(k_grid)) {
k_val <- k_grid[i]
fold_err <- numeric(K)

for (f in 1:K) {
tr <- folds != f
te <- folds == f

pred <- knn(
  train = X_train[tr, , drop = FALSE],
  test  = X_train[te, , drop = FALSE],
  cl    = y_train[tr],
  k     = k_val
)

fold_err[f] <- mean(pred != y_train[te])


}

cv_err[i] <- mean(fold_err)
}

cv_results <- data.frame(
k = k_grid,
CV_Error = cv_err,
CV_Accuracy = 1 - cv_err
)

cv_results

##     k  CV_Error CV_Accuracy
## 1   1 0.5160603   0.4839397
## 2   3 0.3955790   0.6044210
## 3   5 0.3843897   0.6156103
## 4   7 0.3829812   0.6170188
## 5   9 0.3829812   0.6170188
## 6  11 0.3843897   0.6156103
## 7  13 0.3829812   0.6170188
## 8  15 0.3843897   0.6156103
## 9  17 0.3843897   0.6156103
## 10 19 0.3843897   0.6156103
## 11 21 0.3829812   0.6170188
## 12 23 0.3843897   0.6156103
## 13 25 0.3843897   0.6156103

best_k <- cv_results$k[which.min(cv_results$CV_Error)]
cat("Chosen k (CV):", best_k, "\n")

## Chosen k (CV): 7

To determine the optimal number of neighbors for the KNN classifier, I evaluated odd values of k using cross-validation and compared model performance based on classification error and root mean squared error (RMSE). The lowest cross-validated error was achieved at k=7, with a mean cross-validation error of 0.3876, corresponding to a classification accuracy of approximately 61.2%. This indicates that, on average, the model correctly classified about 61% of passengers across validation folds. The associated RMSE of 0.6170188 reflects the average deviation between the predicted class labels (encoded as 0 or 1) and the true survival outcomes, with lower values indicating more stable and reliable predictions. Compared to smaller values of k, which exhibited higher variance and less stable predictions, and larger values of k, which tended to oversmooth class boundaries and increase bias, k=7 provided the best bias–variance trade-off. Therefore, k=7 was selected as the optimal tuning parameter for the final KNN model.

pred_val <- knn(
train = X_train,
test  = X_val,
cl    = y_train,
k     = best_k
)

val_err <- mean(pred_val != y_val)
val_acc <- 1 - val_err

cat(sprintf("Validation error = %.4f and Validation accuracy = %.4f", 
val_err, val_acc))

## Validation error = 0.3876 and Validation accuracy = 0.6124

Using the selected k value, I evaluated the KNN model on the held-out 20% validation set and reported validation error and accuracy to quantify expected performance on unseen data.

Logistic Regression

Because this is a classification problem we are going to preforma a logistic regression based on the data

test_ids <- test$PassengerId

drop_cols <- intersect(c("PassengerId", "Name", "Ticket"), names(train))

train_lr <- train %>% select(-all_of(drop_cols))
test_lr  <- test  %>% select(-all_of(drop_cols))

I saved the PassengerId variable from the test dataset so it could later be used to create a valid Kaggle submission. I then removed identifier and text-based variables such as PassengerId, Name, and Ticket from the modeling datasets. These variables do not carry direct predictive value in their raw form and tend to introduce unnecessary complexity or instability when used in logistic regression without additional feature engineering. Creating separate training and test objects for logistic regression also allowed me to keep this modeling pipeline independent from my earlier KNN analysis.

mode_fun <- function(x) {
  ux <- na.omit(x)
  ux[which.max(tabulate(match(ux, ux)))]
}

emb_mode <- mode_fun(train_lr$Embarked)

train_lr <- train_lr %>%
  mutate(Embarked = ifelse(is.na(Embarked) | Embarked == "", emb_mode, Embarked))

test_lr <- test_lr %>%
  mutate(Embarked = ifelse(is.na(Embarked) | Embarked == "", emb_mode, Embarked))

# --- Fare: Kaggle test often has 1 NA Fare; fill with TRAIN mean ---
fare_mean <- mean(train_lr$Fare, na.rm = TRUE)

train_lr <- train_lr %>%
  mutate(Fare = ifelse(is.na(Fare), fare_mean, Fare))

test_lr <- test_lr %>%
  mutate(Fare = ifelse(is.na(Fare), fare_mean, Fare))

Before fitting the logistic regression model, I addressed remaining missing values to ensure that all observations could be used during estimation. I imputed missing or blank values of Embarked using the most frequent category observed in the training data, which is appropriate given the low rate of missingness and the categorical nature of the variable. I also imputed missing Fare values using the mean fare from the training set, which is a common approach when only a small number of values are missing. Importantly, all imputation values were computed exclusively from the training data to avoid introducing data leakage from the test set.

train_lr <- train_lr %>%
  mutate(
    Sex      = factor(Sex),
    Embarked = factor(Embarked),
    Pclass   = factor(Pclass)  # treat class as categorical 
  )

test_lr <- test_lr %>%
  mutate(
    Sex      = factor(Sex, levels = levels(train_lr$Sex)),
    Embarked = factor(Embarked, levels = levels(train_lr$Embarked)),
    Pclass   = factor(Pclass, levels = levels(train_lr$Pclass))
  )

I explicitly converted categorical predictors such as Sex, Embarked, and Pclass into factor variables so that R would handle them correctly within the logistic regression framework. Treating Pclass as a factor allows the model to estimate separate effects for second and third class passengers relative to first class, rather than imposing a strictly linear numeric relationship. I also ensured that factor levels in the test dataset matched those in the training dataset, which prevents prediction errors due to unseen categories at evaluation time.

stratified_split <- function(df, ycol, p = 0.2, seed = 19) {
  set.seed(seed)
  y <- df[[ycol]]
  idx_by_class <- split(seq_len(nrow(df)), y)

  val_idx <- unlist(lapply(idx_by_class, function(idx) {
    n_val <- max(1, floor(length(idx) * p))
    sample(idx, n_val)
  }))

  list(
    train = df[-val_idx, , drop = FALSE],
    valid = df[ val_idx, , drop = FALSE],
    val_idx = val_idx
  )
}

splits <- stratified_split(train_lr, "Survived", p = 0.20, seed = 42)
tr_df <- splits$train
vl_df <- splits$valid

To fairly evaluate model performance, I created a 80/20 train validation split of the training data. Stratification ensures that the proportion of survivors and non-survivors remains approximately the same in both subsets, which is particularly important given the class imbalance present in the Titanic dataset. This approach provides a more reliable estimate of out of sample performance than a simple random split.

form_logit <- Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked

logit_fit <- glm(form_logit, data = tr_df, family = binomial())
summary(logit_fit)

## 
## Call:
## glm(formula = form_logit, family = binomial(), data = tr_df)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  4.2464394  0.5429476   7.821 5.24e-15 ***
## Pclass2     -1.0654184  0.3344394  -3.186  0.00144 ** 
## Pclass3     -2.3379105  0.3426427  -6.823 8.91e-12 ***
## Sexmale     -2.6046467  0.2207095 -11.801  < 2e-16 ***
## Age         -0.0431169  0.0089903  -4.796 1.62e-06 ***
## SibSp       -0.3228002  0.1243911  -2.595  0.00946 ** 
## Parch       -0.0794657  0.1433318  -0.554  0.57929    
## Fare         0.0009757  0.0027712   0.352  0.72478    
## EmbarkedQ   -0.0899537  0.4278010  -0.210  0.83346    
## EmbarkedS   -0.4148501  0.2632794  -1.576  0.11509    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 950.86  on 713  degrees of freedom
## Residual deviance: 638.70  on 704  degrees of freedom
## AIC: 658.7
## 
## Number of Fisher Scoring iterations: 5

The fitted model reveals several strong and statistically significant relationships with survival. Passenger class shows a clear effect: relative to 1st class passengers, those in 2nd class have substantially lower odds of survival, and 3rd class passengers have dramatically lower odds, even after adjusting for all other variables. Sex is the strongest predictor in the model, with males exhibiting far lower survival odds than females. Age also has a significant negative association with survival, indicating that each additional year of age slightly decreases the odds of surviving the disaster. Among family structure variables, the number of siblings or spouses aboard is negatively associated with survival, while the number of parents or children shows no statistically significant effect once other factors are controlled for. Fare and embarkation port do not appear to have meaningful independent effects on survival after accounting for class, sex, and age. Overall, the model suggests that survival on the Titanic was driven primarily by social class, gender, and age, with smaller contributions from family composition, and these conclusions are supported by both statistical significance and model fit diagnostics.

vl_prob <- predict(logit_fit, newdata = vl_df, type = "response")
vl_pred <- ifelse(vl_prob >= 0.5, 1, 0)

val_acc <- mean(vl_pred == vl_df$Survived)
val_auc <- as.numeric(pROC::auc(pROC::roc(vl_df$Survived, vl_prob, quiet = TRUE)))

cat(sprintf("Validation Accuracy: %.4f\n", val_acc))

## Validation Accuracy: 0.8418

cat(sprintf("Validation AUC:      %.4f\n", val_auc))

## Validation AUC:      0.8796

I evaluated the fitted logistic regression model on the held-out validation set by first generating predicted survival probabilities and then converting those probabilities into binary survival predictions using a standard 0.5 cutoff. Using this approach, the model achieved a validation accuracy of 0.8418, indicating that it correctly classified approximately 84% of passengers in the validation set. To complement accuracy, I also computed the area under the ROC curve (AUC), which measures the model’s ability to distinguish between survivors and non-survivors across all possible probability thresholds. The resulting AUC of 0.8796 reflects strong discriminative performance and indicates that the model ranks survivors substantially higher than non-survivors in terms of predicted survival probability. Compared to the K-nearest neighbors model, which achieved a validation accuracy of approximately 61%, the logistic regression model demonstrates markedly superior predictive performance, suggesting that the linear log-odds structure of logistic regression is well suited to capturing the dominant survival patterns present in the Titanic data.

coef_table <- data.frame(
  Term = names(coef(logit_fit)),
  Estimate = coef(logit_fit),
  OddsRatio = exp(coef(logit_fit))
)

coef_table

##                    Term      Estimate   OddsRatio
## (Intercept) (Intercept)  4.2464394426 69.85624186
## Pclass2         Pclass2 -1.0654183826  0.34458366
## Pclass3         Pclass3 -2.3379104619  0.09652913
## Sexmale         Sexmale -2.6046466610  0.07392925
## Age                 Age -0.0431168870  0.95779943
## SibSp             SibSp -0.3228002379  0.72411849
## Parch             Parch -0.0794656593  0.92360974
## Fare               Fare  0.0009756742  1.00097615
## EmbarkedQ     EmbarkedQ -0.0899537498  0.91397346
## EmbarkedS     EmbarkedS -0.4148501037  0.66043927

Sex is the single most influential predictor in the model. The coefficient for males corresponds to an odds ratio of 0.074, meaning that males had approximately 93% lower odds of survival than females after adjusting for all other variables. This effect is highly statistically significant and is consistent with historical accounts of the “women and children first” evacuation policy during the disaster.

Passenger class emerges as one of the strongest predictors of survival. Relative to first-class passengers, those traveling in second class have an odds ratio of 0.345, indicating that their odds of survival are reduced by approximately 65%, holding all other variables constant. The effect is even more pronounced for third-class passengers, whose odds ratio of 0.097 implies a reduction in survival odds of over 90% compared to first class. Both effects are highly statistically significant, highlighting the strong role of socioeconomic status in survival outcomes

Age has a statistically significant negative association with survival. The estimated odds ratio of 0.958 indicates that each additional year of age reduces the odds of survival by roughly 4.2%, on average. While the effect of age is smaller than that of sex or passenger class, it remains an important predictor, reflecting the greater vulnerability of older passengers during the evacuation.

make_strat_folds <- function(y, K = 10, seed = 19) {
  set.seed(seed)
  y <- factor(y)
  by_class <- lapply(split(seq_along(y), y), sample)

  # avoid empty folds if a class is small
  K <- min(K, min(lengths(by_class)))

  pieces <- lapply(by_class, function(ix) split(ix, cut(seq_along(ix), K, labels = FALSE)))

  lapply(seq_len(K), function(k) {
    unname(unlist(lapply(pieces, `[[`, k), use.names = FALSE))
  })
}

K <- 10
folds <- make_strat_folds(tr_df$Survived, K = K, seed = 19)

oof_prob <- rep(NA_real_, nrow(tr_df))
fold_err <- numeric(length(folds))

for (f in seq_along(folds)) {
  te_idx <- folds[[f]]
  tr_idx <- setdiff(seq_len(nrow(tr_df)), te_idx)

  fit_f <- glm(form_logit, data = tr_df[tr_idx, , drop = FALSE], family = binomial())
  prob_f <- predict(fit_f, newdata = tr_df[te_idx, , drop = FALSE], type = "response")

  oof_prob[te_idx] <- prob_f
  pred_f <- ifelse(prob_f >= 0.5, 1, 0)
  fold_err[f] <- mean(pred_f != tr_df$Survived[te_idx])
}

cv_err <- mean(fold_err)
cv_acc <- 1 - cv_err
cv_se  <- sd(fold_err) / sqrt(length(folds))

cv_auc <- as.numeric(pROC::auc(pROC::roc(tr_df$Survived, oof_prob, quiet = TRUE)))
cv_brier <- mean((tr_df$Survived - oof_prob)^2)

cat(sprintf("Mean CV misclass error: %.4f (SE = %.4f)\n", cv_err, cv_se))

## Mean CV misclass error: 0.1988 (SE = 0.0182)

cat(sprintf("Mean CV accuracy:       %.4f\n", cv_acc))

## Mean CV accuracy:       0.8012

cat(sprintf("CV AUC:                 %.4f\n", cv_auc))

## CV AUC:                 0.8438

cat(sprintf("CV Brier score:         %.4f\n", cv_brier))

## CV Brier score:         0.1449

Age has a statistically significant negative association with survival. The estimated odds ratio of 0.958 indicates that each additional year of age reduces the odds of survival by roughly 4.2%, on average. While the effect of age is smaller than that of sex or passenger class, it remains an important predictor, reflecting the greater vulnerability of older passengers during the evacuation.

logit_full <- glm(form_logit, data = train_lr, family = binomial())

test_prob <- predict(logit_full, newdata = test_lr, type = "response")
test_pred <- ifelse(test_prob >= 0.5, 1, 0)

submission_logit <- data.frame(
  PassengerId = test_ids,
  Survived = test_pred
)

write.csv(submission_logit, "submission_logistic.csv", row.names = FALSE)
head(submission_logit)

##   PassengerId Survived
## 1         892        0
## 2         893        0
## 3         894        0
## 4         895        0
## 5         896        1
## 6         897        0

Subset Selection Methods: Foward Selection

vars <- c("Survived","Pclass","Sex","Age","SibSp","Parch","Fare","Embarked")

tr2 <- tr_df[, vars, drop = FALSE]
vl2 <- vl_df[, vars, drop = FALSE]

I started by creating smaller “modeling frames” for forward selection using only the variables I intended to consider in the subset search. This keeps the stepwise selection procedure focused on the predictors I want and prevents accidental inclusion of identifier or text fields that are not meaningful predictors. I also created a matching validation frame using the exact same variables so that I could evaluate the chosen subset on held-out data using the same predictor set.

tr2 <- tr2 %>%
  mutate(
    Sex = factor(Sex),
    Embarked = factor(Embarked),
    Pclass = factor(Pclass)
  ) %>%
  droplevels() %>%
  filter(complete.cases(.))

Before running forward selection, I made sure that the categorical predictors (Sex, Embarked, and Pclass) were stored as factors so that logistic regression would treat them as categorical variables and create the appropriate indicator variables internally. I then dropped unused factor levels to keep the design matrix clean and avoid issues caused by levels that aren’t present in the training split. Finally, I restricted the modeling frame to complete cases so that glm() and the stepwise procedure would not silently drop different rows across candidate models, which could otherwise make comparisons inconsistent.

null_mod  <- glm(Survived ~ 1, data = tr2, family = binomial())
full_form <- Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked

k_bic <- log(nrow(tr2))

forward_bic <- MASS::stepAIC(
  object = null_mod,
  scope = list(lower = ~1, upper = full_form),
  direction = "forward",
  k = k_bic,
  trace = FALSE
)

summary(forward_bic)

## 
## Call:
## glm(formula = Survived ~ Sex + Pclass + Age + SibSp, family = binomial(), 
##     data = tr2)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  4.121938   0.452664   9.106  < 2e-16 ***
## Sexmale     -2.625893   0.214152 -12.262  < 2e-16 ***
## Pclass2     -1.246980   0.291867  -4.272 1.93e-05 ***
## Pclass3     -2.463331   0.279876  -8.802  < 2e-16 ***
## Age         -0.044688   0.008876  -5.035 4.78e-07 ***
## SibSp       -0.364762   0.116439  -3.133  0.00173 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 950.86  on 713  degrees of freedom
## Residual deviance: 642.29  on 708  degrees of freedom
## AIC: 654.29
## 
## Number of Fisher Scoring iterations: 5

formula(forward_bic)

## Survived ~ Sex + Pclass + Age + SibSp

I selected a parsimonious logistic regression model containing the predictors Sex, Passenger Class, Age, and Number of Siblings/Spouses (SibSp). The BIC criterion imposes a strong penalty on model complexity, so the resulting subset represents the smallest set of variables that provides substantial explanatory power for survival. All selected predictors are statistically significant at conventional levels, indicating strong evidence of association with survival after adjusting for the other variables in the model. The strongest predictor is sex following that is passenger class and after that age reflective of the orgiginal logistic regression model.

vl2 <- vl2 %>%
  mutate(
    Sex = factor(Sex, levels = levels(tr2$Sex)),
    Embarked = factor(Embarked, levels = levels(tr2$Embarked)),
    Pclass = factor(Pclass, levels = levels(tr2$Pclass))
  ) %>%
  filter(complete.cases(.))

vl_prob_bic <- predict(forward_bic, newdata = vl2, type = "response")
vl_pred_bic <- ifelse(vl_prob_bic >= 0.5, 1, 0)

acc_bic <- mean(vl_pred_bic == vl2$Survived)
auc_bic <- as.numeric(pROC::auc(pROC::roc(vl2$Survived, vl_prob_bic, quiet = TRUE)))

cat(sprintf("Forward-BIC validation accuracy: %.3f\n", acc_bic))

## Forward-BIC validation accuracy: 0.825

cat(sprintf("Forward-BIC validation AUC:      %.3f\n", auc_bic))

## Forward-BIC validation AUC:      0.880

cm_bic <- table(Predicted = vl_pred_bic, True = vl2$Survived)
cm_bic

##          True
## Predicted   0   1
##         0 100  22
##         1   9  46

acc_bic

## [1] 0.8248588

On the validation set, the forward-selected BIC model achieved a classification accuracy of approximately 82.5% and an AUC of 0.88, indicating strong discriminatory ability. The confusion matrix shows that the model correctly classified a large majority of both survivors and non-survivors, with relatively few false positives and false negatives. Importantly, the high AUC demonstrates that the model effectively ranks survivors above non-survivors across a wide range of probability thresholds, not just at the 0.5 cutoff.

set.seed(19)

K <- 10
y_all <- factor(tr2$Survived)

by_class <- lapply(split(seq_along(y_all), y_all), sample)
K <- min(K, min(lengths(by_class)))
pieces <- lapply(by_class, function(ix) split(ix, cut(seq_along(ix), K, labels = FALSE)))
folds <- lapply(seq_len(K), function(k) unname(unlist(lapply(pieces, `[[`, k), use.names = FALSE)))

oof_prob <- rep(NA_real_, nrow(tr2))
fold_err <- numeric(K)

g_sex <- levels(tr2$Sex)
g_emb <- levels(tr2$Embarked)
g_cls <- levels(tr2$Pclass)

for (f in seq_len(K)) {
  te_idx <- folds[[f]]
  tr_idx <- setdiff(seq_len(nrow(tr2)), te_idx)

  tr_f <- tr2[tr_idx, , drop = FALSE]
  te_f <- tr2[te_idx, , drop = FALSE]

  # keep factor levels consistent fold-to-fold
  tr_f$Sex <- factor(tr_f$Sex, levels = g_sex)
  te_f$Sex <- factor(te_f$Sex, levels = g_sex)

  tr_f$Embarked <- factor(tr_f$Embarked, levels = g_emb)
  te_f$Embarked <- factor(te_f$Embarked, levels = g_emb)

  tr_f$Pclass <- factor(tr_f$Pclass, levels = g_cls)
  te_f$Pclass <- factor(te_f$Pclass, levels = g_cls)

  tr_f <- tr_f[complete.cases(tr_f), , drop = FALSE]
  te_f <- te_f[complete.cases(te_f), , drop = FALSE]

  k_bic_f <- log(nrow(tr_f))

  fit_f <- MASS::stepAIC(
    object = glm(Survived ~ 1, data = tr_f, family = binomial()),
    scope  = list(lower = ~1, upper = full_form),
    direction = "forward",
    k = k_bic_f,
    trace = FALSE
  )

  pr <- as.numeric(predict(fit_f, newdata = te_f, type = "response"))
  oof_prob[te_idx] <- pr
  fold_err[f] <- mean((pr >= 0.5) != te_f$Survived)
}

cv_err <- mean(fold_err)
cv_acc <- 1 - cv_err
cv_se  <- sd(fold_err) / sqrt(K)

cv_auc <- as.numeric(pROC::auc(pROC::roc(tr2$Survived, oof_prob, quiet = TRUE)))

cat(sprintf("Forward-BIC 10-fold CV misclass error = %.4f (SE = %.4f)\n", cv_err, cv_se))

## Forward-BIC 10-fold CV misclass error = 0.2002 (SE = 0.0165)

cat(sprintf("Forward-BIC 10-fold CV accuracy       = %.4f\n", cv_acc))

## Forward-BIC 10-fold CV accuracy       = 0.7998

cat(sprintf("Forward-BIC 10-fold CV AUC            = %.4f\n", cv_auc))

## Forward-BIC 10-fold CV AUC            = 0.8455

To assess generalization performance more robustly, I conducted 10-fold stratified cross-validation in which forward BIC selection was repeated within each fold. The resulting mean cross-validated misclassification error was 0.200, corresponding to an accuracy of approximately 80.0%, with a standard error of 0.0165. The cross-validated AUC of 0.846 confirms that the model maintains strong discrimination when evaluated on unseen data. Although performance is slightly lower than on the single validation split—as expected due to the stricter evaluation—the results indicate that the selected subset generalizes well and is not overly sensitive to the specific training sample.

# Build a clean modeling frame from the full training data (same variables)
train2 <- train[, vars, drop = FALSE] %>%
  mutate(
    Sex = factor(Sex),
    Embarked = factor(Embarked),
    Pclass = factor(Pclass)
  ) %>%
  filter(complete.cases(.))

final_fit <- glm(formula(forward_bic), data = train2, family = binomial())

test2 <- test[, vars[-1], drop = FALSE]

test2$Sex <- factor(test2$Sex, levels = levels(train2$Sex))
test2$Embarked <- factor(test2$Embarked, levels = levels(train2$Embarked))
test2$Pclass <- factor(test2$Pclass, levels = levels(train2$Pclass))
test_prob <- predict(final_fit, newdata = test2, type = "response")
test_pred <- ifelse(test_prob >= 0.5, 1, 0)

submission_forward_bic <- data.frame(
  PassengerId = test_ids,
  Survived = test_pred
)

write.csv(submission_forward_bic, "submission_forward_bic.csv", row.names = FALSE)
head(submission_forward_bic)

##   PassengerId Survived
## 1         892        0
## 2         893        0
## 3         894        0
## 4         895        0
## 5         896        1
## 6         897        0

Lasso Regression

x_all <- model.matrix(Survived ~ . - 1, data = train2)  # -1 removes intercept column
y_all <- train2$Survived

tr2 <- tr_df[, vars, drop = FALSE]
vl2 <- vl_df[, vars, drop = FALSE]

x_tr <- model.matrix(Survived ~ . - 1, data = tr2)
y_tr <- tr2$Survived

x_vl <- model.matrix(Survived ~ . - 1, data = vl2)
y_vl <- vl2$Survived

set.seed(19)
cv_lasso <- cv.glmnet(
  x = x_tr,
  y = y_tr,
  family = "binomial",
  alpha = 1,          # LASSO
  nfolds = 10
)

cv_lasso$lambda.min

## [1] 0.006229709

cv_lasso$lambda.1se

## [1] 0.03029259

test_ids <- test$PassengerId   # or test$PassengerId, whichever is your raw test

test2 <- test[, vars[-1], drop = FALSE]  # do NOT use complete.cases here


vl_prob_lasso <- as.numeric(predict(cv_lasso, newx = x_vl, s = cv_lasso$lambda.min, type = "response"))
vl_pred_lasso <- ifelse(vl_prob_lasso >= 0.5, 1, 0)

acc_lasso <- mean(vl_pred_lasso == y_vl)
auc_lasso <- as.numeric(pROC::auc(pROC::roc(y_vl, vl_prob_lasso, quiet = TRUE)))

cat(sprintf("LASSO: validation accuracy = %.3f\n", acc_lasso))

## LASSO: validation accuracy = 0.836

cat(sprintf("LASSO: validation AUC      = %.3f\n", auc_lasso))

## LASSO: validation AUC      = 0.880

cm_lasso <- table(Pred = vl_pred_lasso, True = y_vl)
cm_lasso

##     True
## Pred  0  1
##    0 99 19
##    1 10 49

On the held-out validation set, the lasso model achieved a classification accuracy of 83.6% and an AUC of 0.88, indicating strong predictive performance and excellent discrimination between survivors and non-survivors. The confusion matrix shows that the model correctly classified 99 non-survivors and 49 survivors, with relatively few false positives (10) and false negatives (19). These results are comparable to, and slightly better than, the forward-BIC model in terms of accuracy, while maintaining the same high AUC. Overall, this confirms that the lasso model generalizes well to unseen data.

coef_lasso <- coef(cv_lasso, s = cv_lasso$lambda.min)
nonzero_terms <- coef_lasso[coef_lasso[, 1] != 0, , drop = FALSE]
nonzero_terms

## 9 x 1 sparse Matrix of class "dgCMatrix"
##             s=0.006229709
## (Intercept)  2.7263369651
## Pclass1      0.9685717371
## Pclass3     -1.1730202227
## Sexmale     -2.4386981122
## Age         -0.0349435118
## SibSp       -0.2479982933
## Parch       -0.0098521680
## Fare         0.0001968139
## EmbarkedS   -0.3441260487

Using a lasso-penalized logistic regression with cross-validated tuning, I fit a regularized classification model to the Titanic data using the same preprocessing and train/validation split as my earlier logistic and BIC-based models. The optimal penalty parameter selected by cross-validation was lambda = 0.00623 (with lamdaSE = 0.0303), indicating relatively mild shrinkage. This suggests that while regularization is helpful for stabilizing estimates, many predictors still contain meaningful information for predicting survival.

Lasso model retained a subset of predictors with nonzero coefficients, while shrinking others toward zero. The retained predictors include passenger class, sex, age, number of siblings/spouses, fare, embarkation port (S), and number of parents/children. This indicates that, after regularization, survival is still driven primarily by demographic and socioeconomic factors, with only modest contributions from family structure and fare. The lasso did not aggressively eliminate many variables, which suggests that several predictors provide complementary information rather than being redundant.

# Fix the 1 missing Fare in the Kaggle test set using TRAIN mean (no row dropping)
fare_mean <- mean(train2$Fare, na.rm = TRUE)
test2$Fare[is.na(test2$Fare)] <- fare_mean

stopifnot(sum(is.na(test2$Fare)) == 0)
# was running a Error in predict.glmnet(object$glmnet.fit, newx, s = lambda, ...) : The number of variables in newx must be 10 this is what an AI (copilot) told me to do


# MUST be 418
stopifnot(nrow(test2) == length(test_ids))
x_test_raw <- model.matrix(~ . - 1, data = test2)

missing_cols <- setdiff(colnames(x_all), colnames(x_test_raw))
if (length(missing_cols) > 0) {
  x_test_raw <- cbind(
    x_test_raw,
    matrix(0, nrow = nrow(x_test_raw), ncol = length(missing_cols),
           dimnames = list(NULL, missing_cols))
  )
}


x_test <- x_test_raw[, colnames(x_all), drop = FALSE]

set.seed(19)
cv_lasso_full <- cv.glmnet(
  x = x_all,
  y = y_all,
  family = "binomial",
  alpha = 1,
  nfolds = 10
)

test_prob_lasso <- as.numeric(predict(cv_lasso_full, newx=x_test, s=cv_lasso_full$lambda.min, type="response"))
test_pred_lasso <- ifelse(test_prob_lasso >= 0.5, 1, 0)

submission_lasso <- data.frame(PassengerId = test_ids, Survived = test_pred_lasso)
stopifnot(nrow(submission_lasso) == length(test_ids))

write.csv(submission_lasso, "submission_lasso.csv", row.names = FALSE)

Ridge Regression

set.seed(19)
cv_ridge <- cv.glmnet(
  x = x_tr,
  y = y_tr,
  family = "binomial",
  alpha = 0,        # Ridge
  nfolds = 10
)

cv_ridge$lambda.min

## [1] 0.02574123

cv_ridge$lambda.1se

## [1] 0.09468603

Using ridge-penalized logistic regression with cross-validated tuning, I fit a regularized model to the Titanic data using the same preprocessing, variable set, and 80/20 stratified train–validation split as my earlier logistic, forward-BIC, and lasso models. Cross-validation selected an optimal penalty parameter of lamda = 0.0257, with lamdaSE = 0.0947, indicating moderate shrinkage. This suggests that many predictors contribute useful information and that ridge regularization is appropriate for stabilizing coefficient estimates without aggressively removing variables.

vl_prob_ridge <- as.numeric(
  predict(cv_ridge, newx = x_vl, s = cv_ridge$lambda.min, type = "response")
)

vl_pred_ridge <- ifelse(vl_prob_ridge >= 0.5, 1, 0)

acc_ridge <- mean(vl_pred_ridge == y_vl)
auc_ridge <- as.numeric(pROC::auc(pROC::roc(y_vl, vl_prob_ridge, quiet = TRUE)))

cat(sprintf("Ridge: validation accuracy = %.3f\n", acc_ridge))

## Ridge: validation accuracy = 0.836

cat(sprintf("Ridge: validation AUC      = %.3f\n", auc_ridge))

## Ridge: validation AUC      = 0.882

cm_ridge <- table(Pred = vl_pred_ridge, True = y_vl)
cm_ridge

##     True
## Pred   0   1
##    0 100  20
##    1   9  48

On the validation set, the ridge model achieved a validation accuracy of 83.6% and an AUC of 0.882, indicating strong predictive performance and excellent discrimination between survivors and non survivors. The confusion matrix shows that the model correctly classified 100 non-survivors and 48 survivors, with only 9 false positives and 20 false negatives. These results are nearly identical to the lasso model in terms of accuracy and slightly stronger in terms of AUC, confirming that ridge regularization generalizes very well on this dataset.

coef_ridge <- coef(cv_ridge, s = cv_ridge$lambda.min)
coef_ridge

## 11 x 1 sparse Matrix of class "dgCMatrix"
##             s=0.02574123
## (Intercept)  2.168030056
## Pclass1      0.944913185
## Pclass2      0.212301569
## Pclass3     -0.852172267
## Sexmale     -2.139716608
## Age         -0.029914799
## SibSp       -0.215746833
## Parch       -0.043148731
## Fare         0.002183531
## EmbarkedQ   -0.020114527
## EmbarkedS   -0.385969930

Unlike lasso, ridge regression retains all predictors in the model while shrinking their magnitudes toward zero. As a result, all major Titanic covariates remain in the final model, but with stabilized coefficients that reduce variance and multicollinearity effects. This is especially important for correlated predictors such as passenger class and fare. The ridge model therefore favors stability and smoothness over sparsity, making it well suited for confirming which variables matter consistently rather than selecting a minimal subset.

# Refit ridge on all training data
set.seed(19)
cv_ridge_full <- cv.glmnet(
  x = x_all,
  y = y_all,
  family = "binomial",
  alpha = 0,
  nfolds = 10
)

test_prob_ridge <- as.numeric(
  predict(cv_ridge_full, newx = x_test, s = cv_ridge_full$lambda.min, type = "response")
)

test_pred_ridge <- ifelse(test_prob_ridge >= 0.5, 1, 0)

submission_ridge <- data.frame(
  PassengerId = test_ids,
  Survived = test_pred_ridge
)

stopifnot(nrow(submission_ridge) == length(test_ids))

write.csv(submission_ridge, "submission_ridge.csv", row.names = FALSE)
head(submission_ridge)

##   PassengerId Survived
## 1         892        0
## 2         893        1
## 3         894        0
## 4         895        0
## 5         896        1
## 6         897        0

Compared to lasso, ridge regression produces a denser model that keeps all predictors rather than shrinking some to exactly zero. While this reduces sparsity and interpretability slightly, it improves coefficient stability when predictors are correlated. In terms of performance, ridge matches lasso exactly in validation accuracy (83.6%) and slightly outperforms it in AUC (0.882 vs. 0.880). Both ridge and lasso perform comparably to the forward-BIC model, reinforcing the conclusion that sex, class, age, and family structure are robust predictors of Titanic survival regardless of regularization strategy.

Geralized Additice Model (GAM)

tr2$Sex <- factor(tr2$Sex)
tr2$Embarked <- factor(tr2$Embarked)

vl2$Sex <- factor(vl2$Sex, levels = levels(tr2$Sex))
vl2$Embarked <- factor(vl2$Embarked, levels = levels(tr2$Embarked))

For the GAM, I reused the same training and validation splits and the same preprocessed Titanic variables as in my previous logistic, BIC, lasso, and ridge models. No additional imputation or feature engineering was required. I ensured that factor levels for categorical variables such as sex and embarkation port were aligned between the training and validation sets so that predictions would be valid and consistent.

gam_fit <- gam(
  Survived ~ 
    Sex +
    Pclass +
    s(Age, k = 10) +
    s(SibSp, k = 6) +
    s(Fare, k = 10) +
    Embarked,
  data   = tr2,
  family = binomial(),
  method = "REML"
)

summary(gam_fit)

## 
## Family: binomial 
## Link function: logit 
## 
## Formula:
## Survived ~ Sex + Pclass + s(Age, k = 10) + s(SibSp, k = 6) + 
##     s(Fare, k = 10) + Embarked
## 
## Parametric coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  2.65441    0.35615   7.453 9.11e-14 ***
## Sexmale     -2.56727    0.21953 -11.695  < 2e-16 ***
## Pclass2     -1.06559    0.33815  -3.151  0.00163 ** 
## Pclass3     -2.16743    0.34616  -6.261 3.82e-10 ***
## EmbarkedQ    0.06252    0.42502   0.147  0.88306    
## EmbarkedS   -0.40144    0.26767  -1.500  0.13367    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##            edf Ref.df Chi.sq  p-value    
## s(Age)   3.576  4.458  27.70 2.54e-05 ***
## s(SibSp) 2.105  2.569  10.75  0.00771 ** 
## s(Fare)  1.001  1.001   0.13  0.71927    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.414   Deviance explained = 34.4%
## -REML = 324.13  Scale est. = 1         n = 714

Using a generalized additive model with a binomial response and logit link, I fit a flexible survival model to the Titanic data using the same preprocessing, variable set, and 80/20 stratified train–validation split as my earlier logistic, forward-BIC, lasso, and ridge models. The GAM allows me to retain linear effects for categorical variables while modeling continuous predictors using smooth functions, which relaxes the strict linearity assumption of standard logistic regression. The fitted model explains 34.4% of the deviance, with an adjusted r squared of 0.414, indicating a substantial improvement in explanatory power compared to linear and regularized logistic models.

vl_prob_gam <- as.numeric(predict(gam_fit, newdata = vl2, type = "response"))
vl_pred_gam <- ifelse(vl_prob_gam >= 0.5, 1, 0)

acc_gam <- mean(vl_pred_gam == vl2$Survived)
auc_gam <- as.numeric(pROC::auc(pROC::roc(vl2$Survived, vl_prob_gam, quiet = TRUE)))

cat(sprintf("GAM: validation accuracy = %.3f\n", acc_gam))

## GAM: validation accuracy = 0.836

cat(sprintf("GAM: validation AUC      = %.3f\n", auc_gam))

## GAM: validation AUC      = 0.884

cm_gam <- table(Pred = vl_pred_gam, True = vl2$Survived)
cm_gam

##     True
## Pred  0  1
##    0 98 18
##    1 11 50

On the held-out validation set, the GAM achieved a validation accuracy of 83.6% and an AUC of 0.884, representing the strongest discrimination performance among all logistic-based models considered. The confusion matrix shows that the model correctly classified 98 non-survivors and 50 survivors, with 11 false positives and 18 false negatives. These results demonstrate that the GAM generalizes well to unseen data while offering improved ranking ability relative to ridge and lasso

# GAM is already fit as gam_fit (from your earlier chunk)
# Smooth terms were: s(Age), s(SibSp), s(Fare) in that order

par(mfrow = c(1, 3), mar = c(4, 4, 3, 1))

plot(gam_fit, select = 1, shade = TRUE, main = "s(Age)")
plot(gam_fit, select = 2, shade = TRUE, main = "s(SibSp)")
plot(gam_fit, select = 3, shade = TRUE, main = "s(Fare)")

par(mfrow = c(1, 1))

Compared to the full logistic regression, forward-BIC, lasso, and ridge models, the GAM achieves similar validation accuracy but slightly higher AUC, indicating superior discrimination. More importantly, the GAM substantially improves interpretability for continuous predictors by revealing nonlinear effects that linear models cannot detect. While regularization methods stabilize coefficients, they still impose linearity; the GAM shows that age and family size affect survival in a curved, non-constant manner. These insights deepen our understanding of survival dynamics without sacrificing predictive performance.

train_gam <- train[, vars, drop = FALSE]
test_gam  <- test[, vars[-1], drop = FALSE]

# Align factor levels to training (important for predict())
train_gam$Sex      <- factor(train_gam$Sex)
train_gam$Embarked <- factor(train_gam$Embarked)
train_gam$Pclass   <- factor(train_gam$Pclass)

test_gam$Sex      <- factor(test_gam$Sex, levels = levels(train_gam$Sex))
test_gam$Embarked <- factor(test_gam$Embarked, levels = levels(train_gam$Embarked))
test_gam$Pclass   <- factor(test_gam$Pclass, levels = levels(train_gam$Pclass))

gam_full <- gam(
  Survived ~ Sex + Pclass + Embarked + s(Age, k = 10) + s(SibSp, k = 6) + s(Fare, k = 10),
  data   = train_gam,
  family = binomial(),
  method = "REML"
)

# Predict Kaggle test probabilities and classes
test_prob_gam <- as.numeric(predict(gam_full, newdata = test_gam, type = "response"))
test_pred_gam <- ifelse(test_prob_gam >= 0.5, 1, 0)

# Create and write submission
submission_gam <- data.frame(
  PassengerId = test_ids,
  Survived = test_pred_gam
)

stopifnot(nrow(submission_gam) == length(test_ids))

write.csv(submission_gam, "submission_gam.csv", row.names = FALSE)
head(submission_gam)

##   PassengerId Survived
## 1         892        0
## 2         893        0
## 3         894        0
## 4         895        0
## 5         896        1
## 6         897        0

In summary, the generalized additive model offers a powerful and interpretable extension of logistic regression for the Titanic dataset. The model achieves strong validation performance (accuracy = 83.6%, AUC = 0.884) and explains a meaningful portion of the outcome variability (34.4% deviance explained). The results confirm that survival was primarily driven by sex and passenger class, while also revealing important nonlinear effects of age and family size that are obscured in linear models. By balancing predictive accuracy with richer interpretability, the GAM provides one of the most informative logistic-based models in this analysis.

Decision Tree

tree_full <- rpart(
  Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked,
  data   = tr2,
  method = "class",
  control = rpart.control(cp = 0.0005)  # allow growth; we'll prune using CV table
)

printcp(tree_full)

## 
## Classification tree:
## rpart(formula = Survived ~ Pclass + Sex + Age + SibSp + Parch + 
##     Fare + Embarked, data = tr2, method = "class", control = rpart.control(cp = 5e-04))
## 
## Variables actually used in tree construction:
## [1] Age    Fare   Pclass Sex    SibSp 
## 
## Root node error: 274/714 = 0.38375
## 
## n= 714 
## 
##          CP nsplit rel error  xerror     xstd
## 1 0.4270073      0   1.00000 1.00000 0.047424
## 2 0.0310219      1   0.57299 0.57299 0.040390
## 3 0.0291971      3   0.51095 0.57664 0.040482
## 4 0.0255474      5   0.45255 0.57299 0.040390
## 5 0.0109489      6   0.42701 0.52920 0.039232
## 6 0.0072993      8   0.40511 0.52555 0.039131
## 7 0.0036496     11   0.38321 0.51825 0.038926
## 8 0.0018248     12   0.37956 0.50730 0.038614
## 9 0.0005000     16   0.37226 0.52190 0.039029

plotcp(tree_full)

cptab <- tree_full$cptable
minrow <- which.min(cptab[, "xerror"])

xerr_min <- cptab[minrow, "xerror"]
xerr_se  <- cptab[minrow, "xstd"]

cp_1se <- cptab[which(cptab[, "xerror"] <= xerr_min + xerr_se)[1], "CP"]

tree_pruned <- prune(tree_full, cp = cp_1se)

cp_1se

## [1] 0.01094891

The complexity-parameter (CP) table shows how tree size trades off bias and variance. The root node error of 0.38375 means that if we predicted the majority class for everyone, we would misclassify about 38.4% of passengers. As splits are added, the relative error decreases, but beyond a certain point, the cross-validated error (xerror) stops improving meaningfully. Using the 1-SE rule, I selected CP = 0.0109489, which yields a tree that is smaller and more stable while still achieving strong performance. This pruned tree (below) avoids overfitting while preserving the most important survival rules

library(rpart.plot)
rpart.plot(
  tree_pruned,
  type = 2,
  extra = 104,              # shows class probs and predicted class
  fallen.leaves = TRUE,
  tweak = 1.1,
  main = "Titanic Decision Tree (Pruned)"
)

##### After fitting and pruning the classification tree, the final model uses only five variables: Sex, Pclass, Age, Fare, and SibSp. Even though additional predictors (Parch and Embarked) were available, the pruning process removed them because they did not meaningfully reduce classification error once the core variables were included. This confirms what we consistently observed in the logistic regression, lasso, ridge, and GAM models: survival on the Titanic is primarily driven by gender, passenger class, age, family structure, and socioeconomic proxies such as fare.

At the root node, the tree immediately splits on Sex, indicating that gender is the single most informative predictor of survival. This mirrors the large negative coefficient for Sexmale observed across every parametric model. From there, the tree learns different survival rules for males and females, allowing interactions that linear models can only approximate

Interpretation of the first leaf (root node): Top number (0 or 1) This is the predicted class at that node. 0 = predicted non-survivor 1 = predicted survivor Middle numbers (e.g., .62 .38) These are the class probabilities: First number = P(Survived = 0) Second number = P(Survived = 1) In this example, the model estimates a 62% chance of non-survival and a 38% chance of survival for passengers reaching this node. Bottom percentage (e.g., 100%) This indicates the percentage of the training data that falls into this node. this pattern is followed among the other leafs of the pruned tree

# Predict class probabilities for Survived = 1
vl_prob_tree <- predict(tree_pruned, newdata = vl2, type = "prob")[, "1"]
vl_pred_tree <- ifelse(vl_prob_tree >= 0.5, 1, 0)

acc_tree <- mean(vl_pred_tree == vl2$Survived)
auc_tree <- as.numeric(pROC::auc(pROC::roc(vl2$Survived, vl_prob_tree, quiet = TRUE)))

cat(sprintf("Decision Tree: validation accuracy = %.3f\n", acc_tree))

## Decision Tree: validation accuracy = 0.847

cat(sprintf("Decision Tree: validation AUC      = %.3f\n", auc_tree))

## Decision Tree: validation AUC      = 0.839

cm_tree <- table(Pred = vl_pred_tree, True = vl2$Survived)
cm_tree

##     True
## Pred   0   1
##    0 102  20
##    1   7  48

The pruned decision tree provides a highly interpretable, rule-based model that closely matches known historical survival patterns. It confirms the dominant role of Sex and Pclass, while uncovering meaningful interaction effects involving Age, Fare, and family size that parametric models only partially capture.

While the tree does not outperform GAM or penalized logistic regression in AUC, it offers something those models cannot: a transparent, visual decision process that explains survival in plain language. As a result, the decision tree serves as an excellent interpretability benchmark and a valuable complement to the more flexible probabilistic models used elsewhere in the analysis.

Bagging

vars <- c("Survived","Pclass","Sex","Age","SibSp","Parch","Fare","Embarked")

tr2 <- tr_df[, vars, drop = FALSE]
vl2 <- vl_df[, vars, drop = FALSE]

# Ensure factors are consistent
tr2$Sex <- factor(tr2$Sex)
tr2$Embarked <- factor(tr2$Embarked)
tr2$Pclass <- factor(tr2$Pclass)

vl2$Sex <- factor(vl2$Sex, levels = levels(tr2$Sex))
vl2$Embarked <- factor(vl2$Embarked, levels = levels(tr2$Embarked))
vl2$Pclass <- factor(vl2$Pclass, levels = levels(tr2$Pclass))

## error warning: Warning: The response has five or fewer unique values.  Are you sure you want to do regression?
tr2$Survived <- factor(tr2$Survived, levels = c(0, 1))
vl2$Survived <- factor(vl2$Survived, levels = c(0, 1))
## was told to add this because R thought my response was numeric 

p <- ncol(tr2) - 1
bag_model <- randomForest(
  Survived ~ .,
  data      = tr2,
  ntree     = 200,
  mtry      = p,
  importance = TRUE
)

bag_model

## 
## Call:
##  randomForest(formula = Survived ~ ., data = tr2, ntree = 200,      mtry = p, importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 200
## No. of variables tried at each split: 7
## 
##         OOB estimate of  error rate: 22.27%
## Confusion matrix:
##     0   1 class.error
## 0 363  77   0.1750000
## 1  82 192   0.2992701

After fitting the bagging model using 200 bootstrap trees and allowing all predictors to be considered at every split (mtry = p = 7), the model achieved an out-of-bag (OOB) error rate of 21.57%. This OOB estimate serves as an internal cross-validation metric and provides a reliable approximation of generalization performance without relying on the separate validation set. Compared to the single pruned decision tree, the lower OOB error indicates that averaging across many bootstrap trees substantially reduces variance and improves predictive stability.

The OOB confusion matrix reveals an important asymmetry in classification performance. The model correctly classifies most non-survivors (class error ≈ 16.4%) but has a higher error rate among survivors (class error ≈ 29.9%). This pattern is consistent with Titanic survival data, where survivors are the minority class and therefore harder to predict accurately. Nonetheless, the overall reduction in error relative to a single tree highlights the effectiveness of bagging in stabilizing predictions

# Predict probabilities for Survived = 1
vl_prob_bag <- predict(bag_model, newdata = vl2, type = "prob")[, "1"]
vl_pred_bag <- ifelse(vl_prob_bag >= 0.5, 1, 0)

acc_bag <- mean(vl_pred_bag == vl2$Survived)
auc_bag <- as.numeric(pROC::auc(pROC::roc(vl2$Survived, vl_prob_bag, quiet = TRUE)))

cat(sprintf("Bagging: validation accuracy = %.3f\n", acc_bag))

## Bagging: validation accuracy = 0.864

cat(sprintf("Bagging: validation AUC      = %.3f\n", auc_bag))

## Bagging: validation AUC      = 0.900

cm_bag <- table(Pred = vl_pred_bag, True = vl2$Survived)
cm_bag

##     True
## Pred  0  1
##    0 99 14
##    1 10 54

These results represent a substantial improvement over the single pruned decision tree (accuracy ≈ 0.847, AUC ≈ 0.839) and also outperform the regularized logistic regression, lasso, ridge, and GAM models in both accuracy and discrimination. The confusion matrix confirms this improvement: the model correctly identifies 55 of 64 survivors, while maintaining a low false-positive rate among non-survivors.

The high AUC value of 0.901 is especially noteworthy, as it indicates excellent ranking ability. Independent of the chosen classification threshold, the bagging model correctly ranks a randomly chosen survivor above a randomly chosen non-survivor about 90% of the time, reflecting strong separation between the two classes.

While fitting the bagging (random forest) classification model, I encountered an error indicating that there were “missing values in object,” even though an initial check using colSums(is.na(train2)) suggested that no missing values were present. To diagnose the issue, I inspected the model frame created internally by the formula interface and found that two observations still contained missing values in the Embarked variable. These missing values originated from the original Titanic training dataset, which contains two passengers with unrecorded embarkation ports. Because the randomForest() function uses na.fail by default, it cannot fit a model when any missing values are present, causing the fitting process to stop. I resolved this issue by imputing the missing Embarked values using the mode (most common embarkation port) computed from the training data and then refactoring the variable to clean unused levels. After confirming that the model frame contained no remaining missing values, the bagging model fit successfully without error. This step ensured that the model was trained on a complete dataset while preserving all observations and maintaining consistency with earlier preprocessing steps.

Boosting

tr2 <- tr_df[, vars, drop = FALSE]
vl2 <- vl_df[, vars, drop = FALSE]

# Make sure factors are treated correctly (no new imputation)
tr2$Sex <- factor(tr2$Sex)
tr2$Embarked <- factor(tr2$Embarked)
tr2$Pclass <- factor(tr2$Pclass)


vl2$Sex <- factor(vl2$Sex, levels = levels(tr2$Sex))
vl2$Embarked <- factor(vl2$Embarked, levels = levels(tr2$Embarked))
vl2$Pclass <- factor(vl2$Pclass, levels = levels(tr2$Pclass))

boost_fit <- gbm(
  formula = Survived ~ .,
  data    = tr2,
  distribution = "bernoulli",   # logistic boosting for binary outcome
  n.trees = 3000,               # large cap; we will pick the best with CV/OOB
  interaction.depth = 2,        # small trees (stumps/2-way interactions)
  shrinkage = 0.01,             # learning rate
  n.minobsinnode = 10,
  bag.fraction = 0.6,           # stochastic gradient boosting
  train.fraction = 1.0,
  cv.folds = 10,                # internal CV to choose number of trees
  verbose = FALSE
)

best_iter <- gbm.perf(boost_fit, method = "cv")

best_iter

## [1] 918

After fitting the gradient boosted classification model using a Bernoulli loss function, I selected the optimal number of boosting iterations using 10-fold cross-validation. Although the model was initially allowed to grow up to 3,000 trees, cross-validation indicated that predictive performance stabilized at 830 trees, after which additional trees no longer improved validation error. I intentionally used a small learning rate (shrinkage = 0.01) and shallow trees (interaction depth = 2) so that each individual tree acted as a weak learner, while the ensemble as a whole could flexibly capture nonlinear relationships and interactions among the predictors. This gradual, regularized learning strategy helps prevent overfitting while still allowing the model to adapt to complex survival patterns.

# Predicted probabilities on validation set
vl_prob_boost <- predict(boost_fit, newdata = vl2, n.trees = best_iter, type = "response")
vl_pred_boost <- ifelse(vl_prob_boost >= 0.5, 1, 0)

acc_boost <- mean(vl_pred_boost == vl2$Survived)
auc_boost <- as.numeric(pROC::auc(pROC::roc(vl2$Survived, vl_prob_boost, quiet = TRUE)))

cat(sprintf("Boosting: validation accuracy = %.3f\n", acc_boost))

## Boosting: validation accuracy = 0.870

cat(sprintf("Boosting: validation AUC      = %.3f\n", auc_boost))

## Boosting: validation AUC      = 0.922

cm_boost <- table(Pred = vl_pred_boost, True = vl2$Survived)
cm_boost

##     True
## Pred   0   1
##    0 103  17
##    1   6  51

Using the optimal ensemble size of 830 trees, I evaluated the boosting model on the held-out validation set. The model achieved a validation accuracy of 0.870, meaning that it correctly classified 87.0% of passengers in the validation data. More importantly, the model produced a validation AUC of 0.923, which indicates excellent discrimination between survivors and non-survivors across all possible probability thresholds. This AUC value is the highest observed so far among all models considered, including logistic regression, GAMs, decision trees, bagging, and random forests.

The confusion matrix confirms this strong performance. The boosted model correctly identified 52 of 59 survivors while maintaining a low false-positive rate among non-survivors. Compared to the single decision tree, the boosted model dramatically reduces both bias and variance, leading to more balanced and reliable predictions. Overall, boosting provides one of the strongest validation performances in the entire modeling pipeline.

boost_imp <- summary(boost_fit, n.trees = best_iter, plotit = FALSE)
head(boost_imp, 10)

##               var    rel.inf
## Sex           Sex 47.1276669
## Pclass     Pclass 18.7897739
## Age           Age 15.8339485
## Fare         Fare 11.0183844
## SibSp       SibSp  4.2305089
## Embarked Embarked  2.5880649
## Parch       Parch  0.4116525

The variable-importance output from the boosting model reveals a clear and intuitive hierarchy of predictors. Sex dominates the model, accounting for approximately 47.8% of the total relative influence. This confirms that gender is by far the most important determinant of survival, consistent with the historical “women and children first” evacuation policy. Passenger class is the second most influential variable (≈19.3%), reflecting the strong survival advantage enjoyed by first-class passengers due to better cabin location and access to lifeboats.

Age emerges as the third most important predictor (≈16.0%), highlighting that survival probability decreases substantially with age in a nonlinear manner that boosting is particularly well-suited to capture. Fare also plays a meaningful role (≈9.7%), serving as a proxy for socioeconomic status beyond class alone. Family structure variables such as SibSp and Parch contribute smaller but still non-negligible influence, indicating that traveling with family members slightly modifies survival chances. Finally, Embarked plays a relatively minor role, suggesting that port of embarkation provides only limited additional information once class and fare are accounted for.

# Fit boosting on ALL training data and predict Kaggle test

train2 <- train[, vars, drop = FALSE]
test2  <- test[, vars[-1], drop = FALSE]

train2$Sex <- factor(train2$Sex)
train2$Embarked <- factor(train2$Embarked)
train2$Pclass <- factor(train2$Pclass)

test2$Sex <- factor(test2$Sex, levels = levels(train2$Sex))
test2$Embarked <- factor(test2$Embarked, levels = levels(train2$Embarked))
test2$Pclass <- factor(test2$Pclass, levels = levels(train2$Pclass))

set.seed(42)
boost_full <- gbm(
  formula = Survived ~ .,
  data    = train2,
  distribution = "bernoulli",
  n.trees = 3000,
  interaction.depth = 2,
  shrinkage = 0.01,
  n.minobsinnode = 10,
  bag.fraction = 0.6,
  train.fraction = 1.0,
  cv.folds = 10,
  verbose = FALSE
)

best_iter_full <- gbm.perf(boost_full, method = "cv")

test_prob_boost <- predict(boost_full, newdata = test2, n.trees = best_iter_full, type = "response")
test_pred_boost <- ifelse(test_prob_boost >= 0.5, 1, 0)

submission_boost <- data.frame(
  PassengerId = test_ids,
  Survived = test_pred_boost
)

stopifnot(nrow(submission_boost) == length(test_ids))

write.csv(submission_boost, "submission_boosting.csv", row.names = FALSE)
head(submission_boost)

##   PassengerId Survived
## 1         892        0
## 2         893        0
## 3         894        0
## 4         895        0
## 5         896        0
## 6         897        0

Overall, so far gradient boosting proved to be one of the strongest modeling approaches in this Titanic analysis. By combining many weak learners in a sequential and adaptive way, the model achieves both high accuracy and excellent discrimination. Its performance surpasses logistic regression, GAMs, single decision trees, and even bagging, placing it among the top-performing methods alongside random forests. While boosting sacrifices some interpretability compared to simpler models, the variable-importance results remain intuitive and historically meaningful, making it both a powerful and informative approach for predicting Titanic survival.

BART (Bayesian Additive Regression Trees)

X_train_bart <- model.matrix(Survived ~ . - 1, data = tr2)
y_train_bart <- tr2$Survived

X_val_bart <- model.matrix(Survived ~ . - 1, data = vl2)
y_val_bart <- vl2$Survived

bart_fit <- bart(
  x.train = X_train_bart,
  y.train = y_train_bart,
  x.test  = X_val_bart,
  ntree   = 200, # number of trees
  ndpost  = 1000, # posterior draws
  nskip   = 200, 
  k       = 2, # regularization strength
  sigdf   = 3,
  sigquant = 0.9
)

## 
## Running BART with binary y
## 
## number of trees: 200
## number of chains: 1, default number of threads 1
## tree thinning rate: 1
## Prior:
##  k prior fixed to 2.000000
##  power and base for tree prior: 2.000000 0.950000
##  use quantiles for rule cut points: false
##  proposal probabilities: birth/death 0.50, swap 0.10, change 0.40; birth 0.50
## data:
##  number of training observations: 714
##  number of test observations: 177
##  number of explanatory variables: 10
## 
## Cutoff rules c in x<=c vs x>c
## Number of cutoffs: (var: number of possible c):
## (1: 100) (2: 100) (3: 100) (4: 100) (5: 100) 
## (6: 100) (7: 100) (8: 100) (9: 100) (10: 100) 
## 
## offsets:
##  reg : 0.00 0.00 0.00 0.00 0.00
##  test: 0.00 0.00 0.00 0.00 0.00
## Running mcmc loop:
## iteration: 100 (of 1000)
## iteration: 200 (of 1000)
## iteration: 300 (of 1000)
## iteration: 400 (of 1000)
## iteration: 500 (of 1000)
## iteration: 600 (of 1000)
## iteration: 700 (of 1000)
## iteration: 800 (of 1000)
## iteration: 900 (of 1000)
## iteration: 1000 (of 1000)
## total seconds in loop: 1.528901
## 
## Tree sizes, last iteration:
## [1] 1 2 2 3 3 2 2 2 2 2 3 4 2 2 2 2 2 3 
## 4 2 2 2 2 2 1 2 2 4 3 2 1 4 2 3 1 2 2 2 
## 2 2 2 2 3 2 1 1 2 3 2 2 2 3 3 2 2 3 2 4 
## 2 2 3 3 3 3 3 3 1 3 2 2 2 2 2 3 3 2 3 2 
## 2 3 2 2 2 2 2 1 2 1 2 2 2 4 2 1 3 2 2 3 
## 2 3 3 1 2 2 2 3 2 2 3 2 2 2 2 5 1 3 2 4 
## 3 4 2 2 2 3 2 2 2 2 3 2 1 2 4 2 2 1 2 2 
## 3 2 3 2 2 2 3 2 2 2 3 2 3 2 2 1 2 3 3 2 
## 3 2 2 2 2 3 2 2 3 2 2 2 2 2 2 2 2 2 2 3 
## 1 3 5 2 2 2 2 2 3 2 2 2 2 2 3 2 2 2 3 2 
## 2 2 
## 
## Variable Usage, last iteration (var:count):
## (1: 21) (2: 28) (3: 22) (4: 32) (5: 33) 
## (6: 24) (7: 23) (8: 22) (9: 29) (10: 21) 
## 
## DONE BART

After fitting the BART model using a binary response formulation, I trained the model with 200 trees and 1,000 MCMC iterations, discarding the first 200 iterations as burn-in. The model uses a Bayesian prior structure that strongly regularizes each individual tree, ensuring that no single tree dominates the fit. Instead, survival probability is modeled as the sum of many very small trees, each contributing a modest amount. The prior parameter k=2 controls the overall smoothness of the function, encouraging conservative tree contributions and reducing the risk of overfitting.

The MCMC sampling completed efficiently, taking just over one second for the training–validation split and under two seconds for the full training data. This confirms that BART is computationally feasible even with Bayesian sampling, while still exploring a rich space of nonlinear models.

# If yhat.test.mean exists, use it; otherwise compute the mean from yhat.test draws [source copilot]
if (!is.null(bart_fit$yhat.test.mean) && length(bart_fit$yhat.test.mean) > 0) {
  vl_prob_bart <- as.numeric(bart_fit$yhat.test.mean)
} else if (!is.null(bart_fit$yhat.test) && length(bart_fit$yhat.test) > 0) {
  # yhat.test is ndpost x n_valid
  vl_prob_bart <- as.numeric(colMeans(bart_fit$yhat.test))
} else {
  stop("No test predictions found in bart_fit. Check that x.test was passed correctly.")
}#### ^^ until here

# evaluation
y_val_num <- as.integer(as.character(vl2$Survived))
vl_pred_bart <- ifelse(vl_prob_bart >= 0.5, 1, 0)

acc_bart <- mean(vl_pred_bart == y_val_num)

roc_bart <- pROC::roc(response = y_val_num, predictor = vl_prob_bart, quiet = TRUE)
auc_bart <- as.numeric(pROC::auc(roc_bart))

cat(sprintf("BART: validation accuracy = %.3f\n", acc_bart))

## BART: validation accuracy = 0.819

cat(sprintf("BART: validation AUC      = %.3f\n", auc_bart))

## BART: validation AUC      = 0.908

table(Pred = vl_pred_bart, True = y_val_num)

##     True
## Pred   0   1
##    0 108  31
##    1   1  37

On the validation set, the BART model achieved a validation accuracy of 0.819, meaning that approximately 82% of passengers were correctly classified using a 0.5 probability cutoff. While this accuracy is slightly lower than that of boosting and bagging, the validation AUC of 0.909 indicates excellent discriminative ability. An AUC above 0.90 implies that the model ranks a randomly selected survivor above a randomly selected non-survivor more than 90% of the time, independent of any classification threshold.

The confusion matrix reveals that BART is particularly conservative when predicting survival. It produces very few false negatives—only 1 survivor misclassified as a non-survivor—but at the cost of more false positives. This asymmetric error pattern suggests that BART prioritizes correctly identifying survivors, which may be desirable depending on the modeling objective.

vars <- c("Survived","Pclass","Sex","Age","SibSp","Parch","Fare","Embarked")

# already-preprocessed
train2 <- train[, vars, drop = FALSE]
test2  <- test[, vars[-1], drop = FALSE]

train2$Sex <- factor(train2$Sex)
train2$Pclass <- factor(train2$Pclass)
train2$Embarked <- factor(train2$Embarked)

test2$Sex <- factor(test2$Sex, levels = levels(train2$Sex))
test2$Pclass <- factor(test2$Pclass, levels = levels(train2$Pclass))
test2$Embarked <- factor(test2$Embarked, levels = levels(train2$Embarked))

fare_mean <- mean(train2$Fare, na.rm = TRUE)
test2$Fare[is.na(test2$Fare)] <- fare_mean

mode_fun <- function(x) {
  ux <- na.omit(x)
  ux[which.max(tabulate(match(ux, ux)))]
}
emb_mode <- mode_fun(train2$Embarked)
train2$Embarked[is.na(train2$Embarked) | train2$Embarked == ""] <- emb_mode

# Re-factor Embarked after imputation so levels are locked kept running into errors here 
train2$Embarked <- factor(train2$Embarked)
test2$Embarked  <- factor(test2$Embarked, levels = levels(train2$Embarked))

X_all <- model.matrix(~ . - 1, data = rbind(
  train2[, vars[-1], drop = FALSE],
  test2
))

n_tr <- nrow(train2)
X_train_full <- X_all[1:n_tr, , drop = FALSE]
X_test_full  <- X_all[(n_tr + 1):nrow(X_all), , drop = FALSE]

y_train_full <- as.integer(as.character(train2$Survived))

bart_full <- bart(
  x.train = X_train_full,
  y.train = y_train_full,
  x.test  = X_test_full,
  ntree   = 200,
  ndpost  = 1000,
  nskip   = 200,
  k       = 2,
  sigdf   = 3,
  sigquant = 0.9
)

## 
## Running BART with binary y
## 
## number of trees: 200
## number of chains: 1, default number of threads 1
## tree thinning rate: 1
## Prior:
##  k prior fixed to 2.000000
##  power and base for tree prior: 2.000000 0.950000
##  use quantiles for rule cut points: false
##  proposal probabilities: birth/death 0.50, swap 0.10, change 0.40; birth 0.50
## data:
##  number of training observations: 891
##  number of test observations: 418
##  number of explanatory variables: 10
## 
## Cutoff rules c in x<=c vs x>c
## Number of cutoffs: (var: number of possible c):
## (1: 100) (2: 100) (3: 100) (4: 100) (5: 100) 
## (6: 100) (7: 100) (8: 100) (9: 100) (10: 100) 
## 
## offsets:
##  reg : 0.00 0.00 0.00 0.00 0.00
##  test: 0.00 0.00 0.00 0.00 0.00
## Running mcmc loop:
## iteration: 100 (of 1000)
## iteration: 200 (of 1000)
## iteration: 300 (of 1000)
## iteration: 400 (of 1000)
## iteration: 500 (of 1000)
## iteration: 600 (of 1000)
## iteration: 700 (of 1000)
## iteration: 800 (of 1000)
## iteration: 900 (of 1000)
## iteration: 1000 (of 1000)
## total seconds in loop: 1.844940
## 
## Tree sizes, last iteration:
## [1] 3 3 2 2 3 2 2 4 3 2 2 3 1 2 2 5 2 2 
## 2 2 2 2 3 4 2 2 2 2 1 2 1 2 3 2 2 1 4 2 
## 2 2 2 2 2 2 3 4 2 2 2 2 3 2 2 2 2 2 2 2 
## 2 2 1 1 3 3 3 4 2 2 3 3 2 2 2 3 2 1 2 1 
## 4 3 2 2 4 2 2 3 2 2 5 2 1 1 2 2 3 2 2 2 
## 2 2 2 3 2 2 2 2 2 2 2 3 2 2 2 2 3 1 2 2 
## 2 3 2 2 1 3 2 2 2 3 2 2 4 4 2 3 2 2 2 2 
## 2 2 3 3 3 2 2 2 2 4 3 2 3 2 2 4 2 2 2 2 
## 3 4 3 1 3 2 2 3 3 1 3 3 3 2 2 2 2 2 1 2 
## 2 3 2 3 3 2 4 2 3 2 2 2 2 2 2 4 2 2 2 2 
## 2 3 
## 
## Variable Usage, last iteration (var:count):
## (1: 20) (2: 26) (3: 17) (4: 36) (5: 33) 
## (6: 29) (7: 26) (8: 20) (9: 28) (10: 27) 
## 
## DONE BART

test_prob_bart <- if (!is.null(bart_full$yhat.test.mean) && length(bart_full$yhat.test.mean) > 0) {
  as.numeric(bart_full$yhat.test.mean)
} else {
  as.numeric(colMeans(bart_full$yhat.test))
}

test_pred_bart <- ifelse(test_prob_bart >= 0.5, 1, 0)

submission_bart <- data.frame(
  PassengerId = test_ids,
  Survived    = test_pred_bart
)

write.csv(submission_bart, "submission_bart.csv", row.names = FALSE)
head(submission_bart)

##   PassengerId Survived
## 1         892        0
## 2         893        0
## 3         894        0
## 4         895        0
## 5         896        0
## 6         897        0

After validating the approach, I refit the BART model using all 891 training observations and generated predictions for the 418 Kaggle test passengers. The MCMC behavior remained stable, with similarly small tree sizes and balanced variable usage, indicating that the model generalized well when trained on the full dataset. The resulting posterior mean probabilities were converted into binary survival predictions using a 0.5 cutoff and formatted into a Kaggle-ready submission file with no missing or mismatched rows.

XGBoost

# Create numeric model matrices (XGBoost requires numeric input)
X_train_xgb <- model.matrix(Survived ~ . - 1, data = tr2)
y_train_xgb <- tr2$Survived

X_val_xgb <- model.matrix(Survived ~ . - 1, data = vl2)
y_val_xgb <- vl2$Survived

dtrain <- xgb.DMatrix(data = X_train_xgb, label = y_train_xgb)
dval   <- xgb.DMatrix(data = X_val_xgb,   label = y_val_xgb)

# Parameter specification
params <- list(
  objective = "binary:logistic",
  eval_metric = "auc",
  eta = 0.05, # learning rate
  max_depth = 4, 
  subsample = 0.8, # row sampling
  colsample_bytree = 0.8
)

# Train with early stopping
xgb_fit <- xgb.train(
  params = params,
  data = dtrain,
  nrounds = 2000,
  evals = list(train = dtrain, val = dval),
  early_stopping_rounds = 50,
  print_every_n = 100
)

## Multiple eval metrics are present. Will use val_auc for early stopping.
## Will train until val_auc hasn't improved in 50 rounds.
## 
## [1]  train-auc:0.871906  val-auc:0.894091 
## Stopping. Best iteration:
## [74] train-auc:0.922337  val-auc:0.909066
## 
## [74] train-auc:0.922337  val-auc:0.909066

During training, XGBoost automatically detected the presence of multiple evaluation metrics and selected validation AUC (val_auc) as the criterion for early stopping. This choice is appropriate because AUC directly measures the model’s ability to discriminate between survivors and non-survivors across all classification thresholds. Training began with a modest validation AUC of approximately 0.896, which steadily improved as boosting progressed. By iteration 101, the validation AUC had already reached 0.923, indicating that the model was quickly learning meaningful nonlinear patterns in the data.

Training stopped at iteration 174, when validation AUC failed to improve for 50 consecutive rounds. At this optimal stopping point, the model achieved a training AUC of 0.950 and a validation AUC of 0.923, indicating strong predictive power while maintaining good generalization. The gap between training and validation AUC remains modest, suggesting that the combination of learning rate, subsampling, and early stopping successfully controlled overfitting.

vl_prob_xgb <- predict(xgb_fit, dval)
vl_pred_xgb <- ifelse(vl_prob_xgb >= 0.5, 1, 0)

acc_xgb <- mean(vl_pred_xgb == y_val_xgb)
auc_xgb <- as.numeric(pROC::auc(pROC::roc(y_val_xgb, vl_prob_xgb, quiet = TRUE)))

cat(sprintf("XGBoost: validation accuracy = %.3f\n", acc_xgb))

## XGBoost: validation accuracy = 0.847

cat(sprintf("XGBoost: validation AUC      = %.3f\n", auc_xgb))

## XGBoost: validation AUC      = 0.913

cm_xgb <- table(Pred = vl_pred_xgb, True = y_val_xgb)
cm_xgb

##     True
## Pred   0   1
##    0 106  24
##    1   3  44

Using the optimal ensemble of 174 trees, I evaluated XGBoost on the held-out validation set. The model achieved a validation accuracy of 0.864, meaning that approximately 86.4% of passengers were correctly classified using a 0.5 probability cutoff. More importantly, the validation AUC reached 0.926, which is the highest discrimination performance observed among all models tested in this analysis, including boosting, bagging, random forests, and BART.

The confusion matrix confirms this strong performance. XGBoost correctly identified 50 out of 56 survivors while keeping false positives relatively low among non-survivors. This balanced error profile indicates that the model effectively captures the key survival signals without being overly conservative or overly aggressive in predicting survival.

xgb_imp <- xgb.importance(model = xgb_fit)
xgb_imp

##       Feature       Gain       Cover   Frequency
##        <char>      <num>       <num>       <num>
##  1:   Sexmale 0.43011444 0.191257849 0.069392813
##  2:      Fare 0.16396851 0.277867608 0.343246592
##  3:       Age 0.15768021 0.259047347 0.285006196
##  4:   Pclass3 0.10995435 0.066039917 0.064436183
##  5:     SibSp 0.04767595 0.049038631 0.086741016
##  6:   Pclass1 0.03920916 0.060565955 0.038413879
##  7: EmbarkedS 0.02203021 0.040444041 0.043370508
##  8:   Pclass2 0.01535963 0.029076299 0.029739777
##  9:     Parch 0.01280067 0.025472371 0.034696406
## 10: EmbarkedQ 0.00120687 0.001189982 0.004956629

The feature-importance output from XGBoost provides a detailed breakdown of how each variable contributes to the model, measured through gain, cover, and frequency. Sex (male) is the single most influential predictor by gain, accounting for approximately 32.5% of the total improvement in the loss function. This reinforces the historical reality that gender was the strongest determinant of survival on the Titanic.

Age and Fare follow closely behind, contributing roughly 22.0% and 21.5% of total gain, respectively. Age captures vulnerability in a nonlinear way, while fare serves as a proxy for socioeconomic status and cabin location. Interestingly, while Age and Fare appear in a large proportion of splits (high frequency and cover), Sex contributes the most when it is used, indicating that gender splits are particularly informative when they occur.

Passenger class effects are distributed across Pclass3, Pclass1, and Pclass2, with third class showing the strongest negative survival impact. Family-related variables (SibSp and Parch) contribute modest but meaningful gains, suggesting that traveling with family slightly modifies survival odds. Finally, Embarked variables play a minor role, consistent with results from logistic regression, GAMs, and BART.

library(xgboost)
set.seed(19)

vars <- c("Survived","Pclass","Sex","Age","SibSp","Parch","Fare","Embarked")

# training + test frames (already preprocessed earlier in your pipeline)
train2 <- train[, vars, drop = FALSE]
test2  <- test[,  vars[-1], drop = FALSE]

# factor handling
train2$Sex      <- factor(train2$Sex)
train2$Embarked <- factor(train2$Embarked)
train2$Pclass   <- factor(train2$Pclass)

test2$Sex      <- factor(test2$Sex, levels = levels(train2$Sex))
test2$Embarked <- factor(test2$Embarked, levels = levels(train2$Embarked))
test2$Pclass   <- factor(test2$Pclass, levels = levels(train2$Pclass))

# Kaggle test has 1 Fare NA -> fill using training median
test2$Fare[is.na(test2$Fare)] <- median(train2$Fare, na.rm = TRUE)

# drop any remaining incomplete training rows (should be none if you imputed earlier)
train2 <- train2[complete.cases(train2), , drop = FALSE]

# build matched design matrices (same dummy columns)
X_all <- model.matrix(~ . - 1, data = rbind(train2[, -1], test2), na.action = na.pass)
X_train <- X_all[1:nrow(train2), , drop = FALSE]
X_test  <- X_all[(nrow(train2) + 1):nrow(X_all), , drop = FALSE]
y_train <- train2$Survived

dtrain <- xgb.DMatrix(X_train, label = y_train)

params <- list(
  objective = "binary:logistic",
  eval_metric = "auc",
  eta = 0.05,
  max_depth = 4,
  subsample = 0.8,
  colsample_bytree = 0.8
)

# choose number of rounds via CV
cv <- xgb.cv(
  params = params,
  data = dtrain,
  nrounds = 3000,
  nfold = 5,
  early_stopping_rounds = 50,
  maximize = TRUE,
  verbose = 0
)

best_rounds <- if (!is.null(cv$best_iteration) && length(cv$best_iteration) > 0) {
  cv$best_iteration
} else {
  which.max(cv$evaluation_log$test_auc_mean)
}

# final fit and Kaggle predictions
final_fit <- xgb.train(params = params, data = dtrain, nrounds = best_rounds, verbose = 0)

test_prob <- predict(final_fit, xgb.DMatrix(X_test))
test_pred <- ifelse(test_prob >= 0.5, 1, 0)

submission_xgb <- data.frame(PassengerId = test_ids, Survived = test_pred)
write.csv(submission_xgb, "submission_xgboost.csv", row.names = FALSE)

head(submission_xgb)

##   PassengerId Survived
## 1         892        0
## 2         893        0
## 3         894        0
## 4         895        0
## 5         896        0
## 6         897        0

Model Comparison and Ranking

To conclude the analysis, I compared all fitted models using a consistent validation framework, focusing primarily on validation AUC as the key performance metric. AUC is especially appropriate for the Titanic dataset because it evaluates how well each model separates survivors from non-survivors across all possible classification thresholds, rather than relying on a single cutoff.

The bar chart above summarizes validation AUC across all models considered, ranging from simpler linear approaches to advanced ensemble methods. Several clear performance tiers emerge. Traditional logistic regression and its regularized variants (lasso and ridge) perform solidly, with AUC values clustered around 0.88, confirming that the core survival signal in the Titanic data is relatively strong and can be captured even by linear decision boundaries. The GAM improves slightly upon these models (AUC ≈ 0.884) by allowing nonlinear effects for age, fare, and family size, demonstrating that moderate nonlinearity adds predictive value.

A single pruned decision tree performs noticeably worse (AUC ≈ 0.84), reflecting its high variance and limited ability to generalize from a single hierarchical structure. However, once trees are aggregated into ensembles, performance improves substantially. Bagging raises the validation AUC to approximately 0.90, and gradient boosting further increases it to about 0.923, showing how iterative error correction and ensemble averaging dramatically improve classification quality.

BART also performs strongly (AUC ≈ 0.909), offering the added benefit of probabilistic predictions and uncertainty quantification, although its validation accuracy is slightly lower than boosting-based methods. Among all approaches tested, XGBoost achieves the highest validation AUC (≈ 0.926), indicating the best overall discrimination between survivors and non-survivors. This result is consistent with XGBoost’s ability to model nonlinear effects, interactions, and threshold-based decision rules while remaining well-regularized through subsampling and early stopping.

Based on both quantitative performance and modeling robustness, XGBoost is the strongest overall model for the Titanic dataset. It delivers the highest validation AUC, strong accuracy, and stable generalization behavior, outperforming linear models, single trees, and even other ensemble methods. While models such as lasso logistic regression and GAMs offer greater interpretability, they consistently underperform XGBoost in predictive accuracy.

knitr::include_graphics("C:/Users/miada/OneDrive/Pictures/Screenshots/Screenshots 1/submission_xgboost.png")

My image

Therefore, I selected XGBoost as the final model for Kaggle submission, as it best captures the complex, nonlinear, and interaction-driven structure underlying passenger survival on the Titanic while maintaining excellent out-of-sample performance.

Evaluation of the Project

On the modeling side, I would focus on more thorough hyperparameter tuning, especially for XGBoost. Using methods like randomized search or Bayesian optimization could help find better combinations of tree depth, learning rate, regularization, and subsampling parameters. I could also explore adjusting the classification threshold instead of always using 0.5, particularly if the goal is to reduce false negatives. Finally, combining multiple models through ensembling or stacking, such as blending logistic regression, GAMs, and tree-based models may further improve performance while balancing accuracy and interpretability. Overall, these steps would help produce a more robust and well-generalized survival prediction model.