This is a toy example based on a dataset and some questions posed by one course attendant. WHile the process is similar to what one would do in a real case it has been simplified to make it easier to follow, so that the final results of the analysis should not be taken as definitive
A phosphoproteomics experiment has been performed as described in figure 1.
TIO2+PTYR-human-MSS+MSIvsPD.XLSX
Start reading data. Remember spreadsheets from the same workbook can be read separately.
require(readxl)
phData <- read_excel(path= "TIO2+PTYR-human-MSS+MSIvsPD.XLSX", sheet=1)
head(phData)
## # A tibble: 6 x 18
## SequenceModific… Accession Description Score M1_1_MSS M1_2_MSS M5_1_MSS M5_2_MSS T49_1_MSS
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 LYPELSQYMGLSLNE… O00560 Syntenin-1… 48.1 24.3 44476. 0 6269. 1136.
## 2 VDKVIQAQTAFSANP… O00560 Syntenin-1… 67.0 0 43139. 2102. 50355. 249.
## 3 VIQAQTAFSANPANP… O00560 Syntenin-1… 77.7 3413. 172143. 77323. 307637. 98442.
## 4 HADAEMTGYVVTR[6… O15264 Mitogen-ac… 44.9 220431. 145657. 104288. 75887. 773377.
## 5 HADAEMTGYVVTR[9… O15264 Mitogen-ac… 67.4 18255. 8530. 35956. 44102. 57145.
## 6 STGPGASLGTGYDR[… O15551 Claudin-3 … 63.7 644513. 261938. 187023. 124868. 4487444.
## # ... with 9 more variables: T49_2_MSS <dbl>, M42_1_PD <dbl>, M42_2_PD <dbl>, M43_1_PD <dbl>,
## # M43_2_PD <dbl>, M64_1_PD <dbl>, M64_2_PD <dbl>, CLASS <chr>, PHOSPHO <chr>
targets <- read_excel(path= "TIO2+PTYR-human-MSS+MSIvsPD.XLSX", sheet=2)
show(targets)
## # A tibble: 12 x 4
## Sample Sample__1 Individual Phenotype
## <chr> <chr> <dbl> <chr>
## 1 M1_1 M1 1 MSS
## 2 M1_2 M1 1 MSS
## 3 M5_1 M5 2 MSS
## 4 M5_2 M5 2 MSS
## 5 T49_1 T49 3 MSS
## 6 T49_2 T49 3 MSS
## 7 M42_1 M42 4 PD
## 8 M42_2 M42 4 PD
## 9 M43_1 M43 5 PD
## 10 M43_2 M43 5 PD
## 11 M64_1 M64 6 PD
## 12 M64_2 M64 6 PD
A first look at file shows there are both numerical and character data. For ease of use we separate numerical values into a data matrix.
library(tidyverse)
abundances <- phData %>% select (5:16)
head(abundances)
## # A tibble: 6 x 12
## M1_1_MSS M1_2_MSS M5_1_MSS M5_2_MSS T49_1_MSS T49_2_MSS M42_1_PD M42_2_PD M43_1_PD M43_2_PD
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 24.3 44476. 0 6269. 1136. 21934. 0 0 773. 2137.
## 2 0 43139. 2102. 50355. 249. 3239. 1316. 0 0 0
## 3 3413. 172143. 77323. 307637. 98442. 192982. 24851. 16548. 5565. 0
## 4 220431. 145657. 104288. 75887. 773377. 481166. 1027196. 1163747. 4080239. 4885818.
## 5 18255. 8530. 35956. 44102. 57145. 34638. 21231. 49500. 666107. 379314.
## 6 644513. 261938. 187023. 124868. 4487444. 2572575. 535809. 434646. 91362. 65998.
## # ... with 2 more variables: M64_1_PD <dbl>, M64_2_PD <dbl>
Usually it is convenient that rows can be identified by short-but-meaningful names. One may want to use column “accession” for this goal, but it happens not be usable because there are duplicates in this column and rownames cannot have duplicates in R.
This can be solved by cretaing new names with the make.names
function.
newRownames <- make.names(phData$Accession, unique=TRUE)
abundances <- as.data.frame(abundances)
rownames(abundances) <- newRownames
head(abundances)
## M1_1_MSS M1_2_MSS M5_1_MSS M5_2_MSS T49_1_MSS T49_2_MSS M42_1_PD
## O00560 24.29438 44475.964 0.000 6269.141 1135.8169 21933.90 0.000
## O00560.1 0.00000 43138.904 2102.056 50355.051 248.9275 3239.16 1315.904
## O00560.2 3412.60332 172143.040 77323.019 307637.429 98442.2773 192982.37 24851.344
## O15264 220431.17880 145656.887 104287.815 75887.365 773377.4981 481165.54 1027196.292
## O15264.1 18254.77813 8529.755 35955.901 44102.316 57145.1682 34638.01 21231.256
## O15551 644513.31840 261938.025 187023.484 124867.715 4487443.6920 2572575.27 535809.187
## M42_2_PD M43_1_PD M43_2_PD M64_1_PD M64_2_PD
## O00560 0.00 772.9056 2136.746 1820.724 1727.9098
## O00560.1 0.00 0.0000 0.000 0.000 892.3565
## O00560.2 16547.95 5565.2821 0.000 3264.563 5901.9577
## O15264 1163747.38 4080239.1820 4885818.113 3093786.793 2759104.5440
## O15264.1 49499.70 666107.0448 379313.615 255792.117 579765.0018
## O15551 434645.89 91361.8781 65997.913 243250.439 206632.6444
It is clear that abundance vary in a wide numeric range:
summary(abundances)
## M1_1_MSS M1_2_MSS M5_1_MSS M5_2_MSS T49_1_MSS
## Min. : 0 Min. : 0 Min. : 0 Min. : 0 Min. : 0
## 1st Qu.: 5653 1st Qu.: 5497 1st Qu.: 2573 1st Qu.: 3273 1st Qu.: 9306
## Median : 30682 Median : 26980 Median : 20801 Median : 26241 Median : 55641
## Mean : 229841 Mean : 253151 Mean : 232967 Mean : 261067 Mean : 542449
## 3rd Qu.: 117373 3rd Qu.: 113004 3rd Qu.: 113958 3rd Qu.: 130132 3rd Qu.: 223103
## Max. :16719906 Max. :43928481 Max. :15135169 Max. :19631820 Max. :49218872
## T49_2_MSS M42_1_PD M42_2_PD M43_1_PD M43_2_PD
## Min. : 0 Min. : 0 Min. : 0 Min. : 0 Min. : 0
## 1st Qu.: 8611 1st Qu.: 5341 1st Qu.: 4216 1st Qu.: 19641 1st Qu.: 17299
## Median : 46110 Median : 36854 Median : 30533 Median : 67945 Median : 59607
## Mean : 462616 Mean : 388424 Mean : 333587 Mean : 349020 Mean : 358822
## 3rd Qu.: 189141 3rd Qu.: 180252 3rd Qu.: 152088 3rd Qu.: 205471 3rd Qu.: 201924
## Max. :29240206 Max. :48177680 Max. :42558111 Max. :35049402 Max. :63082982
## M64_1_PD M64_2_PD
## Min. : 0 Min. : 0
## 1st Qu.: 11038 1st Qu.: 8660
## Median : 52249 Median : 47330
## Mean : 470655 Mean : 484712
## 3rd Qu.: 209896 3rd Qu.: 206036
## Max. :71750330 Max. :88912734
library(ggplot2)
ggplot(abundances)+geom_histogram(aes(x=M1_1_MSS),bins=20)
This is better seen with a multiple boxplot, that is one boxplot per column.
Multiple boxplots are straightforward in base R. We only need to provide data as a `data.frame
boxplot(abundances)
The plot suggests that it may be useful to transform the data taking logarithms. A quick-and-dirty approach is to take logarithms of the data to which we have added 1 to avoid problems with zeroes.
boxplot(log10(abundances+1), las=2, main="Phosphoproteomics Experiment. Abundance in log 10 scale")
ggplot2
In ggplot2 a boxplot is seen as a plot that relates one continuous and one categorical variable, so that data has to be reshaped into this type of structure (at least one numerical variable and one or more categorical variables) before the plot can be drawn.
This can easily be done with dplyr
’s gather
function.
logDat <- abundances %>%
gather() %>%
mutate (logvalues= log (value+1)) %>%
select (logvalues, key)
head(logDat)
## logvalues key
## 1 3.230582 M1_1_MSS
## 2 0.000000 M1_1_MSS
## 3 8.135524 M1_1_MSS
## 4 12.303345 M1_1_MSS
## 5 9.812237 M1_1_MSS
## 6 13.376252 M1_1_MSS
Once we have the data in the required structure we can plot it.
library(ggplot2)
ggplot(logDat) +
geom_boxplot(aes(x = key, y = logvalues))+
theme(axis.text.x = element_text(angle = 90, hjust = 1))+
ggtitle("Phosphoproteomics Abundances (log 10 scale)")
Notice that the boxplots are drawn based on the alphabetic ordering of the categorical variable and this may not be how we expected to see them!
This can be changed in different ways. Either using a characteristic of the groups to define the order (e.g. order by group median) or simply creating a new factor variable where the desired ordering is imposed.
originalKey <- factor(logDat$key, levels=colnames(abundances))
ggplot(logDat) +
geom_boxplot(aes(x = originalKey, y = logvalues))+
theme(axis.text.x = element_text(angle = 90, hjust = 1))+
ggtitle("Phosphoproteomics Abundances (log 10 scale)")
In order to use the information on groups (“covariates”) we must extract it from the column names. (Notice that they have been prepared specifically to contain this information!!!)
library(stringr)
covs <- str_split(logDat$key, "_", simplify=TRUE)
colnames(covs)<- c("Sample", "Replicate", "Group")
logDat2 <- cbind(logDat,covs)
This can be used for showing the groups in the plots
ggplot(logDat2) +
geom_boxplot(aes(x = originalKey, y = logvalues, fill=Group, colour=Replicate))+
theme(axis.text.x = element_text(angle = 90, hjust = 1))+
ggtitle("Phosphoproteomics Abundances (log 10 scale)")
A nice way to plot a dataset is to do a Principal Components Analysis and then plot the first components.
For simplicity we use a function that has already been prepared and encapsulated do draw a scatterplot from the results of a PCA. It is available from github.
source("https://raw.githubusercontent.com/uebvhir/UEB_PCA/master/UEB_plotPCA3.R")
plotPCA3(datos=as.matrix(log10(abundances+1)), labels=colnames(abundances),
factor=targets$Phenotype,title ="Phosphoproteomic data",
scale=FALSE, colores=1:2, size = 3.5, glineas = 2.5)
The plot suggests that MSS and PD groups are differnt but the PD group seems more heterogeneous than the MSS one.
The analysis to select differentially abundant proteins will be based on a package called limma
used mainly in microarrays. This package is part of Bioconductor (http://bioconductor.org) a giantic project created to develop open source software for omics data analysis based on R.
Packages from Bioconductor need to be installed with a specific function.
if (!(require(limma))){
source("http://bioconductor.org/biocLite.R")
biocLite("limma")
}
Limma provides a sophisticated approach to omics data analysis based on the theory of the linear model. For simplicity we only describe the main ideas of the approach. Much more information is available in the user’s guide.
The limma approach consists of creating
These two matrices are used in a set of calls to provide a list of potentially differentially expressed features.
library(limma)
targets <- as.data.frame(targets)
groups <- as.factor(targets$Phenotype)
designMat <- model.matrix(~ -1 + groups)
show(designMat)
## groupsMSS groupsPD
## 1 1 0
## 2 1 0
## 3 1 0
## 4 1 0
## 5 1 0
## 6 1 0
## 7 0 1
## 8 0 1
## 9 0 1
## 10 0 1
## 11 0 1
## 12 0 1
## attr(,"assign")
## [1] 1 1
## attr(,"contrasts")
## attr(,"contrasts")$groups
## [1] "contr.treatment"
Technical replicates can be handled using a function called duplicatecorrelation
that computes the mean correlation between replicates. This correlation is used then in the model fitting process.
if (!require(statmod)) install.packages("statmod")
dupcor <- duplicateCorrelation(abundances, designMat,block=targets$Individual)
dupcor$consensus.correlation
## [1] 0.8770936
Finally a “contrasta matrix” is prepared to describe the comparisons
require(limma)
contMat <- makeContrasts(mainEff=groupsPD-groupsMSS, levels=designMat)
show(contMat)
## Contrasts
## Levels mainEff
## groupsMSS -1
## groupsPD 1
Once the design and the contrast matrix are built we proceed as follows:
fit <- lmFit(abundances, designMat, block=targets$Individual,correlation=dupcor$consensus)
fit2 <- contrasts.fit(fit, contMat)
fit2 <- eBayes(fit2)
results<- topTableF(fit2, adjust="BH", number=nrow(abundances))
head(results)
## mainEff AveExpr F P.Value adj.P.Val
## O00193.1 802048.92 641359.81 34.87586 0.0001157082 0.1663884
## P35269.1 -435826.87 412752.57 23.39308 0.0005694049 0.2040053
## P16144.3 313748.10 270927.30 22.75343 0.0006322067 0.2040053
## Q5T0Z8.1 225849.08 176503.85 21.01902 0.0008488023 0.2040053
## Q9H1C7 677600.13 446716.73 20.98220 0.0008542838 0.2040053
## P84103 27775.59 18777.87 20.70335 0.0008971889 0.2040053
The results of the analysis is a table where features, here proteins, are ordered from most to least differentially expressed according the results of the comparison performed.
The results can be shown in a volcano Plot that plots the biological (fold change) vs statistical (-log (p-value)) effect.
volcanoplot(fit2, highlight=10, names=rownames(abundances), cex=0.75,
xlim=c(-1e+06,1e+06))
The dispersion of the points suggests that the data are too variable so it may be good to review normalization and try to apply some type of dat scaling that reduces variability while keeping interesting differences.