Doing a PCA in R is easy: Just run the function prcomp()
on your matrix of scaled numeric predictor variables. There’s just one problem, however. The result is an object of class prcomp
that doesn’t fit nicely into the tidyverse framework, e.g. for visualization. While it’s reasonably easy to extract the relevant info with some base-R manipulations, I’ve never been happy with this approach. But now, I’ve realized that all the necessary functions to do this tidyverse-style are available in the broom package.
For our PCA example, we’ll need the packages tidyverse and broom. Note that as of this writing, we need the current development version of broom because of a bug in tidy.prcomp()
. We’ll also use the cowplot package for plot themes.
library(tidyverse)
# ── Attaching packages ────────────────────────────────── tidyverse 1.3.0 ──
# ✓ ggplot2 3.3.2 ✓ purrr 0.3.4
# ✓ tibble 3.0.3 ✓ dplyr 1.0.2
# ✓ tidyr 1.1.2 ✓ stringr 1.4.0
# ✓ readr 1.3.1 ✓ forcats 0.5.0
# ── Conflicts ───────────────────────────────────── tidyverse_conflicts() ──
# x dplyr::filter() masks stats::filter()
# x dplyr::lag() masks stats::lag()
library(broom) # devtools::install_github("tidymodels/broom")
library(cowplot)
We’ll be analyzing the biopsy
dataset, which comes originally from the MASS package. It is a breast cancer dataset from the University of Wisconsin Hospitals, Madison from Dr. William H. Wolberg. He assessed biopsies of breast tumors for 699 patients; each of nine attributes was scored on a scale of 1 to 10. The true outcome (benign/malignant) is also known.
biopsy <- read_csv("https://wilkelab.org/classes/SDS348/data_sets/biopsy.csv")
# Parsed with column specification:
# cols(
# clump_thickness = col_double(),
# uniform_cell_size = col_double(),
# uniform_cell_shape = col_double(),
# marg_adhesion = col_double(),
# epithelial_cell_size = col_double(),
# bare_nuclei = col_double(),
# bland_chromatin = col_double(),
# normal_nucleoli = col_double(),
# mitoses = col_double(),
# outcome = col_character()
# )
In general, when performing PCA, we’ll want to do (at least) three things:
- Look at the data in PC coordinates.
- Look at the rotation matrix.
- Look at the variance explained by each PC.
Let’s do these three things in turn.
Look at the data in PC coordinates
We start by running the PCA and storing the result in a variable pca_fit
. There are two issues to consider here. First, the prcomp()
function can only deal with numeric columns, so we need to remove all non-numeric columns from the data. This is straightforward using the where(is.numeric)
tidyselect construct. Second, we normally want to scale the data values to unit variance before PCA. We do so by using the argument scale = TRUE
in prcomp()
.
pca_fit <- biopsy %>%
select(where(is.numeric)) %>% # retain only numeric columns
prcomp(scale = TRUE) # do PCA on scaled data
As an alternative to scale = TRUE
, we could also have scaled the data by explicitly invoking the scale()
function.
pca_fit <- biopsy %>%
select(where(is.numeric)) %>% # retain only numeric columns
scale() %>% # scale data
prcomp() # do PCA
Now, we want to plot the data in PC coordinates. In general, this means combining the PC coordinates with the original dataset, so we can color points by categorical variables present in the original data but removed for the PCA. We do this with the augment()
function from broom, which takes as arguments the fitted model and the original data. The columns containing the fitted coordinates are called .fittedPC1
, .fittedPC2
, etc.
pca_fit %>%
augment(biopsy) %>% # add original dataset back in
ggplot(aes(.fittedPC1, .fittedPC2, color = outcome)) +
geom_point(size = 1.5) +
scale_color_manual(
values = c(malignant = "#D55E00", benign = "#0072B2")
) +
theme_half_open(12) + background_grid()
Look at the data in PC coordinates
Next, we plot the rotation matrix. The rotation matrix is stored as pca_fit$rotation
, but here we’ll extract it using the tidy()
function from broom. When applied to prcomp
objects, the tidy()
function takes an additional argument matrix
, which we set to matrix = "rotation"
to extract the rotation matrix.
# extract rotation matrix
pca_fit %>%
tidy(matrix = "rotation")
# # A tibble: 81 x 3
# column PC value
# <chr> <dbl> <dbl>
# 1 clump_thickness 1 -0.302
# 2 clump_thickness 2 -0.141
# 3 clump_thickness 3 0.866
# 4 clump_thickness 4 -0.108
# 5 clump_thickness 5 0.0803
# 6 clump_thickness 6 -0.243
# 7 clump_thickness 7 -0.00852
# 8 clump_thickness 8 0.248
# 9 clump_thickness 9 -0.00275
# 10 uniform_cell_size 1 -0.381
# # … with 71 more rows
Now in the context of a plot:
# define arrow style for plotting
arrow_style <- arrow(
angle = 20, ends = "first", type = "closed", length = grid::unit(8, "pt")
)
# plot rotation matrix
pca_fit %>%
tidy(matrix = "rotation") %>%
pivot_wider(names_from = "PC", names_prefix = "PC", values_from = "value") %>%
ggplot(aes(PC1, PC2)) +
geom_segment(xend = 0, yend = 0, arrow = arrow_style) +
geom_text(
aes(label = column),
hjust = 1, nudge_x = -0.02,
color = "#904C2F"
) +
xlim(-1.25, .5) + ylim(-.5, 1) +
coord_fixed() + # fix aspect ratio to 1:1
theme_minimal_grid(12)
Look at the variance explained by each PC
Finally, we’ll plot the variance explained by each PC. We can again extract this information using the tidy()
function from broom, now by setting the matrix
argument to matrix = "eigenvalues"
.
pca_fit %>%
tidy(matrix = "eigenvalues")
# # A tibble: 9 x 4
# PC std.dev percent cumulative
# <dbl> <dbl> <dbl> <dbl>
# 1 1 2.43 0.656 0.656
# 2 2 0.881 0.0862 0.742
# 3 3 0.734 0.0599 0.802
# 4 4 0.678 0.0511 0.853
# 5 5 0.617 0.0422 0.895
# 6 6 0.549 0.0335 0.928
# 7 7 0.543 0.0327 0.961
# 8 8 0.511 0.0290 0.990
# 9 9 0.297 0.00982 1
Now in the context of a plot.
pca_fit %>%
tidy(matrix = "eigenvalues") %>%
ggplot(aes(PC, percent)) +
geom_col(fill = "#56B4E9", alpha = 0.8) +
scale_x_continuous(breaks = 1:9) +
scale_y_continuous(
labels = scales::percent_format(),
expand = expansion(mult = c(0, 0.01))
) +
theme_minimal_hgrid(12)
The first component captures 65% of the variation in the data and, as we can see from the first plot in this post, nicely separates the benign samples from the malignant samples.