Introduction

Disclaimer

This is a toy example based on a dataset and some questions posed by one course attendant. WHile the process is similar to what one would do in a real case it has been simplified to make it easier to follow, so that the final results of the analysis should not be taken as definitive

The problem

A phosphoproteomics experiment has been performed as described in figure 1.

A phosphoproteomics experiment

A phosphoproteomics experiment

  • The experiment has analyzed (3 + 3) PDX models of two different subtypes using Phosphopeptide enriched samples.
  • LC-MS analysis of 2 technical duplicates
  • The results set consisted of Normalized abundances of MS signals for ca. 1400 phosphopeptides
  • Goal of the analysis: **search phosphopeptides that allow differentiation of the two tumor groups*
  • This should be made with both Statistical Analysis and visualization.
  • Data have been provided as an excel file: TIO2+PTYR-human-MSS+MSIvsPD.XLSX
  • Groups are defined as:
    • MSS group: Samples M1, M5 and T49,
    • PD group: Samples M42, M43 and M64 with two technical replicates for each sample
  • The first column, “SequenceModification” contains abundance values for the distinct phosphopetides. Other columns can be omitted.

Data exploration

Reading data

Start reading data. Remember spreadsheets from the same workbook can be read separately.

require(readxl)
phData <- read_excel(path= "TIO2+PTYR-human-MSS+MSIvsPD.XLSX", sheet=1)
head(phData)
## # A tibble: 6 x 18
##   SequenceModific… Accession Description Score M1_1_MSS M1_2_MSS M5_1_MSS M5_2_MSS T49_1_MSS
##   <chr>            <chr>     <chr>       <dbl>    <dbl>    <dbl>    <dbl>    <dbl>     <dbl>
## 1 LYPELSQYMGLSLNE… O00560    Syntenin-1…  48.1     24.3   44476.       0     6269.     1136.
## 2 VDKVIQAQTAFSANP… O00560    Syntenin-1…  67.0      0     43139.    2102.   50355.      249.
## 3 VIQAQTAFSANPANP… O00560    Syntenin-1…  77.7   3413.   172143.   77323.  307637.    98442.
## 4 HADAEMTGYVVTR[6… O15264    Mitogen-ac…  44.9 220431.   145657.  104288.   75887.   773377.
## 5 HADAEMTGYVVTR[9… O15264    Mitogen-ac…  67.4  18255.     8530.   35956.   44102.    57145.
## 6 STGPGASLGTGYDR[… O15551    Claudin-3 …  63.7 644513.   261938.  187023.  124868.  4487444.
## # ... with 9 more variables: T49_2_MSS <dbl>, M42_1_PD <dbl>, M42_2_PD <dbl>, M43_1_PD <dbl>,
## #   M43_2_PD <dbl>, M64_1_PD <dbl>, M64_2_PD <dbl>, CLASS <chr>, PHOSPHO <chr>
targets <- read_excel(path= "TIO2+PTYR-human-MSS+MSIvsPD.XLSX", sheet=2)
show(targets)
## # A tibble: 12 x 4
##    Sample Sample__1 Individual Phenotype
##    <chr>  <chr>          <dbl> <chr>    
##  1 M1_1   M1                 1 MSS      
##  2 M1_2   M1                 1 MSS      
##  3 M5_1   M5                 2 MSS      
##  4 M5_2   M5                 2 MSS      
##  5 T49_1  T49                3 MSS      
##  6 T49_2  T49                3 MSS      
##  7 M42_1  M42                4 PD       
##  8 M42_2  M42                4 PD       
##  9 M43_1  M43                5 PD       
## 10 M43_2  M43                5 PD       
## 11 M64_1  M64                6 PD       
## 12 M64_2  M64                6 PD

Data visualization

A first look at file shows there are both numerical and character data. For ease of use we separate numerical values into a data matrix.

library(tidyverse)
abundances <- phData %>% select (5:16) 
head(abundances)
## # A tibble: 6 x 12
##   M1_1_MSS M1_2_MSS M5_1_MSS M5_2_MSS T49_1_MSS T49_2_MSS M42_1_PD M42_2_PD M43_1_PD M43_2_PD
##      <dbl>    <dbl>    <dbl>    <dbl>     <dbl>     <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
## 1     24.3   44476.       0     6269.     1136.    21934.       0        0      773.    2137.
## 2      0     43139.    2102.   50355.      249.     3239.    1316.       0        0        0 
## 3   3413.   172143.   77323.  307637.    98442.   192982.   24851.   16548.    5565.       0 
## 4 220431.   145657.  104288.   75887.   773377.   481166. 1027196. 1163747. 4080239. 4885818.
## 5  18255.     8530.   35956.   44102.    57145.    34638.   21231.   49500.  666107.  379314.
## 6 644513.   261938.  187023.  124868.  4487444.  2572575.  535809.  434646.   91362.   65998.
## # ... with 2 more variables: M64_1_PD <dbl>, M64_2_PD <dbl>

Usually it is convenient that rows can be identified by short-but-meaningful names. One may want to use column “accession” for this goal, but it happens not be usable because there are duplicates in this column and rownames cannot have duplicates in R.

This can be solved by cretaing new names with the make.names function.

newRownames <- make.names(phData$Accession, unique=TRUE)
abundances <- as.data.frame(abundances)
rownames(abundances) <- newRownames
head(abundances)
##              M1_1_MSS   M1_2_MSS   M5_1_MSS   M5_2_MSS    T49_1_MSS  T49_2_MSS    M42_1_PD
## O00560       24.29438  44475.964      0.000   6269.141    1135.8169   21933.90       0.000
## O00560.1      0.00000  43138.904   2102.056  50355.051     248.9275    3239.16    1315.904
## O00560.2   3412.60332 172143.040  77323.019 307637.429   98442.2773  192982.37   24851.344
## O15264   220431.17880 145656.887 104287.815  75887.365  773377.4981  481165.54 1027196.292
## O15264.1  18254.77813   8529.755  35955.901  44102.316   57145.1682   34638.01   21231.256
## O15551   644513.31840 261938.025 187023.484 124867.715 4487443.6920 2572575.27  535809.187
##            M42_2_PD     M43_1_PD    M43_2_PD    M64_1_PD     M64_2_PD
## O00560         0.00     772.9056    2136.746    1820.724    1727.9098
## O00560.1       0.00       0.0000       0.000       0.000     892.3565
## O00560.2   16547.95    5565.2821       0.000    3264.563    5901.9577
## O15264   1163747.38 4080239.1820 4885818.113 3093786.793 2759104.5440
## O15264.1   49499.70  666107.0448  379313.615  255792.117  579765.0018
## O15551    434645.89   91361.8781   65997.913  243250.439  206632.6444

Plotting data

It is clear that abundance vary in a wide numeric range:

summary(abundances)
##     M1_1_MSS           M1_2_MSS           M5_1_MSS           M5_2_MSS          T49_1_MSS       
##  Min.   :       0   Min.   :       0   Min.   :       0   Min.   :       0   Min.   :       0  
##  1st Qu.:    5653   1st Qu.:    5497   1st Qu.:    2573   1st Qu.:    3273   1st Qu.:    9306  
##  Median :   30682   Median :   26980   Median :   20801   Median :   26241   Median :   55641  
##  Mean   :  229841   Mean   :  253151   Mean   :  232967   Mean   :  261067   Mean   :  542449  
##  3rd Qu.:  117373   3rd Qu.:  113004   3rd Qu.:  113958   3rd Qu.:  130132   3rd Qu.:  223103  
##  Max.   :16719906   Max.   :43928481   Max.   :15135169   Max.   :19631820   Max.   :49218872  
##    T49_2_MSS           M42_1_PD           M42_2_PD           M43_1_PD           M43_2_PD       
##  Min.   :       0   Min.   :       0   Min.   :       0   Min.   :       0   Min.   :       0  
##  1st Qu.:    8611   1st Qu.:    5341   1st Qu.:    4216   1st Qu.:   19641   1st Qu.:   17299  
##  Median :   46110   Median :   36854   Median :   30533   Median :   67945   Median :   59607  
##  Mean   :  462616   Mean   :  388424   Mean   :  333587   Mean   :  349020   Mean   :  358822  
##  3rd Qu.:  189141   3rd Qu.:  180252   3rd Qu.:  152088   3rd Qu.:  205471   3rd Qu.:  201924  
##  Max.   :29240206   Max.   :48177680   Max.   :42558111   Max.   :35049402   Max.   :63082982  
##     M64_1_PD           M64_2_PD       
##  Min.   :       0   Min.   :       0  
##  1st Qu.:   11038   1st Qu.:    8660  
##  Median :   52249   Median :   47330  
##  Mean   :  470655   Mean   :  484712  
##  3rd Qu.:  209896   3rd Qu.:  206036  
##  Max.   :71750330   Max.   :88912734
library(ggplot2)
ggplot(abundances)+geom_histogram(aes(x=M1_1_MSS),bins=20)

This is better seen with a multiple boxplot, that is one boxplot per column.

Boxplots in “base R”

Multiple boxplots are straightforward in base R. We only need to provide data as a `data.frame

boxplot(abundances)

The plot suggests that it may be useful to transform the data taking logarithms. A quick-and-dirty approach is to take logarithms of the data to which we have added 1 to avoid problems with zeroes.

boxplot(log10(abundances+1), las=2, main="Phosphoproteomics Experiment. Abundance in log 10 scale")

Boxplots in ggplot2

In ggplot2 a boxplot is seen as a plot that relates one continuous and one categorical variable, so that data has to be reshaped into this type of structure (at least one numerical variable and one or more categorical variables) before the plot can be drawn.

This can easily be done with dplyr’s gather function.

logDat <- abundances %>% 
  gather() %>%
  mutate (logvalues= log (value+1)) %>%
  select (logvalues, key)
head(logDat)
##   logvalues      key
## 1  3.230582 M1_1_MSS
## 2  0.000000 M1_1_MSS
## 3  8.135524 M1_1_MSS
## 4 12.303345 M1_1_MSS
## 5  9.812237 M1_1_MSS
## 6 13.376252 M1_1_MSS

Once we have the data in the required structure we can plot it.

library(ggplot2)
ggplot(logDat) + 
  geom_boxplot(aes(x = key, y = logvalues))+
  theme(axis.text.x = element_text(angle = 90, hjust = 1))+
  ggtitle("Phosphoproteomics Abundances (log 10 scale)")

Notice that the boxplots are drawn based on the alphabetic ordering of the categorical variable and this may not be how we expected to see them!

This can be changed in different ways. Either using a characteristic of the groups to define the order (e.g. order by group median) or simply creating a new factor variable where the desired ordering is imposed.

originalKey <- factor(logDat$key, levels=colnames(abundances))
ggplot(logDat) + 
  geom_boxplot(aes(x = originalKey, y = logvalues))+
  theme(axis.text.x = element_text(angle = 90, hjust = 1))+
  ggtitle("Phosphoproteomics Abundances (log 10 scale)")

Creating covariates from the labels

In order to use the information on groups (“covariates”) we must extract it from the column names. (Notice that they have been prepared specifically to contain this information!!!)

library(stringr)
covs <- str_split(logDat$key, "_", simplify=TRUE)
colnames(covs)<- c("Sample", "Replicate", "Group")
logDat2 <- cbind(logDat,covs)

This can be used for showing the groups in the plots

ggplot(logDat2) + 
  geom_boxplot(aes(x = originalKey, y = logvalues, fill=Group, colour=Replicate))+
  theme(axis.text.x = element_text(angle = 90, hjust = 1))+
  ggtitle("Phosphoproteomics Abundances (log 10 scale)")

Multivariate visualization: Principal components analysis

A nice way to plot a dataset is to do a Principal Components Analysis and then plot the first components.

For simplicity we use a function that has already been prepared and encapsulated do draw a scatterplot from the results of a PCA. It is available from github.

source("https://raw.githubusercontent.com/uebvhir/UEB_PCA/master/UEB_plotPCA3.R")
plotPCA3(datos=as.matrix(log10(abundances+1)), labels=colnames(abundances), 
         factor=targets$Phenotype,title ="Phosphoproteomic data",
         scale=FALSE, colores=1:2, size = 3.5, glineas = 2.5)

The plot suggests that MSS and PD groups are differnt but the PD group seems more heterogeneous than the MSS one.

Statistical Analysis

The analysis to select differentially abundant proteins will be based on a package called limma used mainly in microarrays. This package is part of Bioconductor (http://bioconductor.org) a giantic project created to develop open source software for omics data analysis based on R.

Packages from Bioconductor need to be installed with a specific function.

if (!(require(limma))){
  source("http://bioconductor.org/biocLite.R")
  biocLite("limma")
}

Limma provides a sophisticated approach to omics data analysis based on the theory of the linear model. For simplicity we only describe the main ideas of the approach. Much more information is available in the user’s guide.

The limma approach consists of creating

  1. a “design matrix” that describes assignment of samples to groups.
  2. a “contrast” matrix that describes the comparisons to be performed

These two matrices are used in a set of calls to provide a list of potentially differentially expressed features.

library(limma)
targets <- as.data.frame(targets)
groups <- as.factor(targets$Phenotype)
designMat <- model.matrix(~ -1 + groups)
show(designMat)
##    groupsMSS groupsPD
## 1          1        0
## 2          1        0
## 3          1        0
## 4          1        0
## 5          1        0
## 6          1        0
## 7          0        1
## 8          0        1
## 9          0        1
## 10         0        1
## 11         0        1
## 12         0        1
## attr(,"assign")
## [1] 1 1
## attr(,"contrasts")
## attr(,"contrasts")$groups
## [1] "contr.treatment"

Technical replicates can be handled using a function called duplicatecorrelationthat computes the mean correlation between replicates. This correlation is used then in the model fitting process.

if (!require(statmod)) install.packages("statmod")
dupcor <- duplicateCorrelation(abundances, designMat,block=targets$Individual)
dupcor$consensus.correlation
## [1] 0.8770936

Finally a “contrasta matrix” is prepared to describe the comparisons

require(limma)
contMat <- makeContrasts(mainEff=groupsPD-groupsMSS, levels=designMat)
show(contMat)
##            Contrasts
## Levels      mainEff
##   groupsMSS      -1
##   groupsPD        1

Once the design and the contrast matrix are built we proceed as follows:

fit <- lmFit(abundances, designMat, block=targets$Individual,correlation=dupcor$consensus)
fit2 <- contrasts.fit(fit, contMat)
fit2 <- eBayes(fit2)
results<- topTableF(fit2, adjust="BH",  number=nrow(abundances))
head(results)
##             mainEff   AveExpr        F      P.Value adj.P.Val
## O00193.1  802048.92 641359.81 34.87586 0.0001157082 0.1663884
## P35269.1 -435826.87 412752.57 23.39308 0.0005694049 0.2040053
## P16144.3  313748.10 270927.30 22.75343 0.0006322067 0.2040053
## Q5T0Z8.1  225849.08 176503.85 21.01902 0.0008488023 0.2040053
## Q9H1C7    677600.13 446716.73 20.98220 0.0008542838 0.2040053
## P84103     27775.59  18777.87 20.70335 0.0008971889 0.2040053

The results of the analysis is a table where features, here proteins, are ordered from most to least differentially expressed according the results of the comparison performed.

The results can be shown in a volcano Plot that plots the biological (fold change) vs statistical (-log (p-value)) effect.

volcanoplot(fit2, highlight=10, names=rownames(abundances), cex=0.75,
            xlim=c(-1e+06,1e+06))

The dispersion of the points suggests that the data are too variable so it may be good to review normalization and try to apply some type of dat scaling that reduces variability while keeping interesting differences.