Would you survive the Titanic?

8 min readJun 1, 2022

Introduction

The following analysis is for my Google Data Analytics Capstone Project from Coursera. For this project I will be using the Dataset from the “Titanic — Machine Learning from Disaster” competition, originally posted by Kaggle.

This Capstone Project will incorporate all the steps I’ve learned throughout the Google Data Analytics Course in order to clean, sort, analyze, and visualize data to answer key questions using some elements from BiqQuery, Excel, SQL, R, and Tableau.

Important things to know:

Dataset is a subset of data.
Dataset may be biased.
Analysis can be done with one or more programs. The main programs I am incorporating are Spreadsheets and RStudio Cloud.
Analysis and visualizations are for entertainment and personal educational purposes only. Beginner-friendly.
This project analysis is meant to document the steps I have taken to clean, sort, and visualize the given dataset to find relationships and identify patterns.

I — Data Cleaning using Google Spreadsheets.

II — Data Visualization with RStudio Cloud.

III — Final Analysis and Thoughts.

The Six Steps of Data Analysis

The following is a guide map for the Data Analysis Process.

Source: Coursera — Google Data Analytics Certification

Ask:

Based on the Titanic dataset we are looking to answer the following questions:

Who is likely to survive the Titanic?
Does class, age, and/or gender have any effects on survivability?

Prepare, Process, and Analyze:

The dataset can be found on Kaggle, in the Data section of the competition. There are three downloadable files. We will be using the dataset “train.cvs” to complete our process and analysis.

Based on the data set there are a total of 891 unique variables and 12 columns.
The variables that we are most interested in are: “Survived”, “Pclass”, ‘Sex”, “Age”, and “Fare”.
One of the first steps to data analysis is to clean the data. Here we are making sure that the data does not have any null values or duplicates. We will be using Google Spreadsheets to complete this.

Determine data types and unique values.

Google Spreadsheet Formula(s) Used:

= UNIQUE(array) Conditional Format to locate blank values.

Survived — Structured, Numeric, Unique Values: “0 = Did not Survive, 1 = Survived”, No Null/Blank Values.
Pclass — Structured, Numeric, Unique Values: “1 = 1st Class, 2 = 2nd Class, 3 = 3rd Class”, No Null/Blank Values.
Sex — Structured, String, Unique Values: “Female, Male”, no Null/Blank values.
Age — Structured, Numeric, Continuous Values, Contains Blank Values (Needs cleaning).
Fare — Structured, Numeric, Continuous Values (Needs cleaning).

Copy and paste selected columns into a new sheet.

*Note: Bigger datasets may require a more robust program like SQL to clean the data.

Round up values in AGE and FARE:

Round up values in AGE and Fare to the nearest whole integer.
AGE: We will add a new column called new_age
Using the following Function: =ROUNDUP() we are able to round the age up to the nearest whole integer. This also places a “0” in places where AGE is blank.
FARE: We will add a new column called new_fare
Using the following Function: =ROUND(cell, [place]) we are able to round the FARE to the nearest whole integer using the first decimal.
Select all converted values and Special Paste as Values in a new column.
Download sheet as .csv file.

Analyzing and Visualizing Data in RStudio Cloud

I used RStudio Cloud for additional cleaning and visualization of the dataset. I installed the following packages before starting:

## Install below packages before starting your analysis and 
## visualization.
## install.packages("ggplot2")
## install.packages("tidyverse")
## Load all packageslibrary(ggplot2)
library(dplyr)
library(tidyverse)

As I viewed the dataset, I realized that the “Survived” column would work better if the values were categorical instead of numeric.

## Change "Survived" column from numeric to categorical for better visualizationtitanic2$Survived[titanic2$Survived == "0"] <- "Did Not Survive"
titanic2$Survived[titanic2$Survived == "1"] <- "Survived"

Visualize Survivors Based on Gender:

# Survivors by Gender
## Visualization of Survivors by Gender using a Stacked Charttitanic2 %>%
  filter(Survived == "Did Not Survive" | Survived == "Survived") %>%
  drop_na(Sex) %>%  
  ggplot(aes(x = Sex, fill = Survived)) +
  geom_bar(alpha = 0.5) + 
  theme_bw() +
  theme(panel.grid.major = element_blank(),  
        panel.grid.minor = element_blank()) +
  labs(title = "Survivors by Gender", 
       x = "Gender",
       y = "Count")

Based on this graph we can conclude that females have a higher survival rate than males.

Visualize Survivors Based on Class:

The “Pclass” column is made up of numeric values from “1”, “2”, and “3”. Let’s change them to categorical values to reflect their class.

## Change numeric values in "Pclass" to categorical valuestitanic2$Pclass[titanic2$Pclass == "1"] <- "First Class"
titanic2$Pclass[titanic2$Pclass == "2"] <- "Second Class"
titanic2$Pclass[titanic2$Pclass == "3"] <- "Third Class"

Let’s add our visualization code:

## Visualization of Survivors by Class using a Stacked Charttitanic2 %>%
  filter(Pclass == "First Class" | Pclass == "Second Class" |
        Pclass == "Third Class") %>%     ## This filter is not necessary but I like to see the variables I am
    ## working with for clarity.   drop_na(Pclass) %>%  
  ggplot(aes(x = Pclass, fill = Survived)) +
  geom_bar(alpha = 0.5) + 
  theme_bw() +
  theme(panel.grid.major = element_blank(),  
        panel.grid.minor = element_blank()) +
  labs(title = "Survivors by Class", 
       x = "Class",
       y = "Count")

Based on this chart, we can see that there is a higher percentage of people in First Class who survived compared to Third Class. Although the number of survivors from each Class is similar, there are far more people in Third Class who did not survive. Sucks to be Third, huh?

Visualize Survivors Based on Age:

In the new_age column, I noticed that there were “0” values during our initial rounding in Spreadsheets. Since there is no way to determine the real age of these individuals we can replace all “0” values with NA, which we can drop later in our visualization.

## Drop all "0" values in "new_age" and replace with NAtitanic2$new_age[titanic2$new_age == 0] <- NA## Visualization of Survivors by Age Grouptitanic2 %>%
  drop_na(new_age) %>%  
  ggplot(aes(x = new_age, fill = Survived)) +
  geom_bar(alpha = 0.5) + 
  scale_x_continuous(breaks = seq(10, 90, by=10)) +
  theme_bw() +
  theme(panel.grid.major = element_blank(),  
        panel.grid.minor = element_blank()) +
  labs(title = "Survivors by Age", 
       x = "Age",
       y = "Count")## This graph doesn't show the percentage of people who survived 
## in each age group

Based on this graph, it is clear that the majority of passengers were between the ages of 20–40. However, this does not clearly show a relationship between Age and Survival Rate. Let’s try another visualization using percentages for each age group.

## The visualization below show a percentage of each group of people who 
## survived or didn't survive. titanic2 %>%
  drop_na(new_age) %>% 
  ggplot(aes(x = new_age, binwidth = 10)) +
    geom_histogram(aes(fill = Survived), position = 'fill', binwidth = 10) +
  scale_x_continuous(breaks = seq(0, 90, by=10)) +
  theme_bw() +
  theme(panel.grid.major = element_blank(),  
        panel.grid.minor = element_blank()) +
  labs(title = "Percentage of Survivors by Age", 
       x = "Age",
       y = "Percentage")

Here we see a breakdown of the percentage of people from each age group who Survived. Based on this graph we can see that the percentage of people surviving from each age group is relatively the same. Note: The 80’s group only had one variable, who did survive.

So does age really matter? And does the phrase, “ Women and children first” hold true? Well, according to this graph, children from ages 1–10 have the highest percentage of survival. Also, the person in his/her 80’s could’ve slipped into the lifeboat. Let’s get more details on our mysterious 80-year-old.

Mysterious 80 Year Old:

We want to find more details about our 80-year-old survivor. Let’s incorporate the new_fare column into our analysis and visualization.

# Analyzing the mysterious 80 year old. Visualization of Fare, Age, and Sexggplot(titanic2) +
  geom_point(mapping = aes(x = new_fare, y= new_age, color = Pclass, 
                           shape = Sex, )) +
  theme_bw() +
  theme(panel.grid.major = element_blank(),  
          panel.grid.minor = element_blank()) +
  labs(title = "Comparison of Fare, Age, and Class", 
        x = "Fare",
        y = "Age")

From this graph, we notice that the prices for First Class are mostly high, with Second Class and Third class below $100. The point we are most interested in is the individual who is 80 years old. We now know that he is male and also in First class. He must be a very important person.

Let’s see who this person is.

Since we know he is the only person who is 80, we can simply filter the new_age column in descending order. Voila! The mystery guy is called Barkworth, Mr. Algernon Henry Wilson, occupation: Justice of Peace. Woah!

Final Analysis and Thoughts

Let’s answer our initial question of who is more likely to survive the Titanic?

Based on our findings, we can conclude that if you are from First Class, a Female, and between the ages of 1–40, you will have the highest chance of survival.

Our second question: Does class, age, and/or gender have any effects on survivability?

Class — Yes
Gender — Yes
Age — Not Likely

To conclude, I really enjoyed working with this dataset because “Titanic” was the first movie I watched as a child when I first immigrated from China.

This project really helped me put what I have learned from the course to practice. It may not be perfect but I thoroughly enjoyed the process of defining the questions and answering them with data manipulation, organization, and visualization.

Final words? To all the Data Analysts, Engineers, Scientists, and Coders, I am open to learning and improving my work. I’d love to hear from you if you have any resources that have been helpful to you or just want to chat!

(Article originally published on LinkedIn on February 25, 2022)