Data manipulation
VHIR-UEB-for-019v.01
1 Directory commands
- getwd() tells you what is the path of the R project (.Rproj)
getwd()
## [1] "D:/Dropboxvhir/Nuevo Equipo VHIR10 Dropbox/Estela Ariza Moral/UEB-compartida/Aplicacions i Programari/R_Manual"
- setwd() allows to set/change the directory for the R project (.Rproj)
# This example sets the directory in Desktop folder, R project is saved in "e:/user.name/Desktop/my_project.Rproj"
setwd("e:/user.name/Desktop")
- dir() extracts the file names, by default extract the file names of the current working directory of the R project
# same as dir(getwd()):
dir()
## [1] "data" "diabetes.xls"
## [3] "my_work_space.RData" "R_Manual-DataManipulation.html"
## [5] "R_Manual-DataManipulation.Rmd" "R_Manual-DataManipulation.Rproj"
## [7] "Rcode"
Get the file names of another working directory:
<- "C:/user.name/directory/file/" path
dir(path)
## character(0)
Print the objects in environment:
print(ls())
## [1] "path"
Remove objects from environment:
# removes all objects from memory
rm(list=ls())
print(ls())
# remove specific objects
rm(dat, path)
2 Good to know
- Packages require() and library() both load a R package, but they have a main difference:
- require() will output a warning if a package is not installed and then continue to execute the code. It is usually only used if you are loading packages inside a function so that the function will continue to execute even if a package doesn’t exist.
- library() will output an error and stop the execution of the code.
- To install old versions of packages, try these options in a new R Script:
# option 1
remove.packages("knitr") # remove the actual version
require(remotes)
::install_version(package ="knitr", version ="1.43", repos=getOption("repos"))
remotes
# option 2
remove.packages("kableExtra") # remove the actual version
require(utils)
<- "https://cran.r-project.org/src/contrib/Archive/kableExtra/kableExtra_1.3.4.tar.gz"
packageurl ::install.packages(pkgs = packageurl, repos = NULL,type = "source") utils
p_load() allows to install and load a R package, but first install and load package pacman.
- Example: p_load(ggplot2, dplyr)
identical() returns boolean TRUE/FALSE if two R objects (data frames, vectors, matrices, …) are the same.
- Example: identical(dat1, dat2)
To access for more information for:
- For a given function of a R package, in the console type “?func”.
- Example: ?identical
- For a package:
- Example: help(package=“dplyr”)
- For a given function of a R package, in the console type “?func”.
Load package magrittr to use operator %<>% in R.
apply(), lapply(), sapply():
- apply(): lets us apply a function to the rows or columns of a matrix or data frame.
- Example: apply( sample_matrix, 2, mean) -> use apply() function across column to find mean of each column
- lapply(): lets us apply a function to list objects.
- Example: lapply(names, toupper) -> all strings in list names are in capital letters
- sapply(): lets us apply a function to lists, vectors, data frames and returns an array or matrix object.
- Example: sapply(sample_data, max) -> get the maximum from columns of data set sample_data
- apply(): lets us apply a function to the rows or columns of a matrix or data frame.
Function filter() + combined with operator ! -> be careful. For example, assume you have a factor variable [cat] that takes values “No”, “Yes” and some are NA. If you do dat %>% filter(cat!=“No”) it will exclude those rows where NA in [cat], but maybe you are still interested in keeping those cases even if they have [cat]=NA. In that case, do dat %>% filter(cat!=“No” | is.na(cat)).
- Example: dat %>% filter(grupo_1!=“0” | is.na(grupo_1))
Use a particular function from a package. This is useful when you load two packages that have a function with the same name.
- Example: dplyr::select() -> needs to be specified if in the same project you load package dplyr and MASS (MASS::select()). Both packages have a select() function.
Load built-in dataset into the environment
- Example: data(iris)
Difference between paste() and paste0():
- paste0(): allows concatenating strings but by default does not leave a blank space in between strings. Example: paste0(“ID”, seq(10))
- paste(): allows concatenating strings and by default leaves a blank space in between strings. Example: paste(“ID”, seq(10))
3 Read data
3.1 xls/xlsx
p_load(readxl)
<- read_excel("diabetes.xls") # read_excel("diabetes.xlsx")
dat head(dat)
class(dat)
## [1] "tbl_df" "tbl" "data.frame"
p_load(rio)
# form rio package -> same as rio::import("diabetes.xls")
<- import("diabetes.xls")
dat class(dat)
## [1] "data.frame"
# export
export("diabetes.xls")
3.2 Flat file databases
In flat file databases, records (rows) follow a uniform format, and there are no structures for indexing or recognizing relationships between records (rows).
util package is loaded by default when you load R and it has these functions:
- read.table() : Main function. Reads a file in table format and creates a data frame from it. It offers many arguments to classify the incoming data.
- read.csv() : Wrapper function for read.table(). Used to read comma-separated (CSV) files.
- read.delim() : Wrapper Function used to read tab-separated files. read.delim() is used if the numbers in your file use periods(.) as decimals.
- read.csv2() : read.csv() and read.csv2() are identical. The only difference is that they are set up depending on whether you use periods or commas as decimal points in numbers.
- read.delim2() : read.delim2 is used when the numbers in your file use commas(,) as decimals.
- rio package can also be used for importing and exporting flat files
3.3 R data files
Rdata can be used to save one or multiple objects or the entire work session.
# Saving on object in RData format
save(dat, file = "my_data.RData")
# Save multiple objects
save(data1, data2, file = "my_data.RData")
# To load the data again
load("my_data.RData")
Store the entire work session:
# export
save.image(file = "my_work_space.RData")
# import
load("my_work_space.RData")
Rds used to save one single object.
# export
saveRDS(dat, file="my_data.rds")
# import
readRDS(file="my_data.rds")
4 Basic data frame operations
From now on, we use package dplyr.
p_load(dplyr, magrittr, forcats)
Display structure of data frame:
str(dat)
## 'data.frame': 149 obs. of 11 variables:
## $ numpacie: num 1 2 3 4 5 6 7 8 9 10 ...
## $ mort : chr "Vivo" "Vivo" "Vivo" "Vivo" ...
## $ tempsviu: num 12.4 12.4 9.6 7.2 14.1 14.1 12.4 14.2 12.4 14.5 ...
## $ edat : num 44 49 49 47 43 47 50 36 50 49 ...
## $ bmi : num 34.2 32.6 22 37.9 42.2 33.1 36.5 38.5 41.5 34.1 ...
## $ edatdiag: num 41 48 35 45 42 44 48 33 47 45 ...
## $ tabac : chr "No fumador" "Fumador" "Fumador" "No fumador" ...
## $ sbp : num 132 130 108 128 142 156 140 144 134 102 ...
## $ dbp : num 96 72 58 76 80 94 86 88 78 68 ...
## $ ecg : chr "Normal" "Normal" "Normal" "Frontera" ...
## $ chd : chr "No" "No" "Si" "Si" ...
Display most relevant statistics of all variables in data frame:
summary(dat)
## numpacie mort tempsviu edat
## Min. : 1.00 Length:149 Min. : 0.00 Min. :31.00
## 1st Qu.: 38.00 Class :character 1st Qu.: 7.30 1st Qu.:43.00
## Median : 75.00 Mode :character Median :11.60 Median :50.00
## Mean : 75.01 Mean :10.52 Mean :52.17
## 3rd Qu.:112.00 3rd Qu.:13.90 3rd Qu.:60.00
## Max. :149.00 Max. :16.90 Max. :86.00
## bmi edatdiag tabac sbp
## Min. :18.20 Min. :26.00 Length:149 Min. : 98.0
## 1st Qu.:26.60 1st Qu.:38.00 Class :character 1st Qu.:124.0
## Median :31.20 Median :45.00 Mode :character Median :138.0
## Mean :31.78 Mean :45.99 Mean :139.1
## 3rd Qu.:35.20 3rd Qu.:53.00 3rd Qu.:152.0
## Max. :59.70 Max. :81.00 Max. :222.0
## dbp ecg chd
## Min. : 58.00 Length:149 Length:149
## 1st Qu.: 74.00 Class :character Class :character
## Median : 80.00 Mode :character Mode :character
## Mean : 90.04
## 3rd Qu.: 88.00
## Max. :862.00
# for a particular column
summary(dat$numpacie)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 38.00 75.00 75.01 112.00 149.00
Reset row index:
# useful when you have previously sliced (filtered) the data frame using dat[rows_filtered, ]
rownames(dat) <- NULL
Generate new column that takes the values of the row indices:
# option 1
%>% tibble::rownames_to_column(var="Old_RowIndex") dat
# option 2
%<>% mutate(Old_RowIndex=row_number()) dat
Create empty data frame:
# data.frame(col1_name=character(n), col2_name=character(n), col3_name=character(n))
data.frame(col1=character(2), col2=integer(2), col3=numeric(2)) # n=2 -> empty data frame with two rows
5 Row-wise operations
5.1 Check for uniqueness in a column
Filter rows that appear more than once in column [numpacie]:
# option 1
# dat %>% filter(numpacie %in% names(which(table(dat$numpacie) > 1))) %>% View() -> same but it previous the operation in the View window
%>% filter(numpacie %in% names(which(table(dat$numpacie) > 1))) dat
# To verify that filter is correct:
%>% filter(numpacie=="135")# indeed, numpacie 135 has two records dat
# option 2
%>% arrange(numpacie) %>% group_by(numpacie) %>% filter(n()>1) dat
And only view data frame for columns [numpacie], [mort]:
# dat %>% filter(numpacie %in% names(which(table(dat$numpacie) > 1))) %>% select(numpacie, mort, edat, bmi) %>% View() -> same but it previous the operation in the View window
%>% filter(numpacie %in% names(which(table(dat$numpacie) > 1))) %>% select(numpacie, mort, edat, bmi) dat
5.2 Check uniqueness of two columns
Check if each [numpacie] is assigned to a unique [id]: in this example, numpacie 135 is assigned to id 134 and id 135.
%>% group_by(numpacie) %>% filter(n_distinct(id)>1) dat
# To verify that filter is correct:
%>% filter(numpacie=="135")# indeed, numpacie 135 is assigned to two ids dat
Check if each [id] is assigned to a unique [numpacie]: in this example, id 1 is assigned to numpacie 1, 2. And id 5 is assigned to numpacie 5,6.
%>% group_by(id) %>% filter(n_distinct(numpacie)>1) dat
5.3 Filter rows
5.3.1 Remove NA rows
# option 1
<- dat[apply(dat, 1, function(row) all(is.na(row))), ] dat
# option 2
<- dat[rowSums(is.na(dat))==ncol(dat), ] dat
Now remove rows where all columns are NA or where column ID has NA:
<- dat[ !(is.na(dat$id) | rowSums(is.na(dat))==ncol(dat)), ] dat
5.3.2 Filter firsts rows based on condition
Remove firsts rows until you find row that has character “PAT_ID” regardless of the column where “PAT_ID” is in.
# find row index where "PAT_ID" appears in any column
<- which(apply(dat, 1, function(row) any(row=="PAT_ID")))[1]
row_idx # remove rows before row index where "PAT_ID" appears in any column
<- dat[(row_idx):nrow(dat), ] dat
5.3.3 Filter rows that do not match condition(s)
# option 1
-which(dat$mort == "Vivo" & dat$tabac == "No fumador"),] dat[
# option 2 -> row index is the count of rows that match condition
%>% filter(!(mort == "Vivo" & dat$tabac == "No fumador")) dat
Both options are not identical because option 1 keeps original row index when slicing but option 2 resets the row index.
identical(dat %>% filter(!(mort == "Vivo" & dat$tabac == "No fumador")), dat[-which(dat$mort == "Vivo" & dat$tabac == "No fumador"),])
## [1] FALSE
5.3.4 Filter rows that match condition(s)
Match one condition:
# option 1
which(dat$tabac == "No fumador"),] dat[
# option 2
%>% filter(tabac == "No fumador") dat
# To check that filter is correct:
%>% filter(tabac == "No fumador") %>% nrow()# is 57 dat
## [1] 57
table(dat$tabac)#count is 57 for 'No fumador'
##
## Ex fumador Fumador No fumador
## 41 51 57
Match two conditions:
You could also add more conditions for filtering rows.
# option 1
which(dat$mort == "Vivo" & dat$tabac == "No fumador"),] dat[
# option 2
%>% filter(mort == "Vivo" & dat$tabac == "No fumador") dat
Filter rows that have “veces” or “día” in column [fruta_freq]
%>% filter(grepl("veces|día", fruta_freq)) %>% select(numpacie, fruta_freq) dat
5.3.5 Filter rows that have same value in all columns
Filter those rows that at least one of the columns have ‘No Info’:
# same as: dat %>% filter(if_any(everything(), ~. == "No Info"))
%>% filter(if_any(everything(), ~. %in% "No Info")) dat
Filter those rows that have ‘No Info’ in all columns:
# alternative: dat %>% filter(if_all(everything(), ~. == "No Info"))
%>% filter(if_all(everything(), ~. %in% "No Info")) dat
Filter those rows that have ‘No Info’ in all columns regardless of the values in columns [numpacie], [edat]. In other words, [numpacie] and [edat] can have ‘No Info’ or something else, they will not be taken into account when checking condition:
%>% filter(if_any(.cols=-c(numpacie, edat), ~. %in% "No Info")) dat
Filter rows where at least one numeric column has a value greater than 500. In this example, [dbp] has two rows with values greater than 500:
%>% filter(if_any(.cols=where(is.numeric),~.>500)) dat
5.3.6 Filter rows based on a list
5.3.6.1 Match condition
Only keep rows where numpacie is in list:
%>% filter(numpacie %in% c("8", "19", "20")) dat
Do not keep rows where numpacie is in list:
%>% filter(!numpacie %in% c("8", "19")) dat
5.3.6.2 Match conditions
Do not keep rows where numpacie is in list and keep rows where age is above 40:
%>% filter(!numpacie %in% c("8", "19") & edat > 40) dat
Do not keep rows where numpacie is in list nor age is above 40:
%>% filter(!(edat>40) & !numpacie %in% c("8", "19")) dat
5.3.7 Filter row with last information in a given column
This is useful if you have a data frame in long format, that is, several rows refer to the same individual.
For each individual, keep the last row that has information in column [vivo] for each
%>% group_by(id) %>% filter(row_number() == max(which(complete.cases(vivo)))) %>% ungroup() %>% as.data.frame() dat1
5.4 Collapse rows
Assume you want to collapse multiple rows into one single row:
<- data.frame(
data ID = c("01", "01", "01", "02", "02"),
X = c(NA, 1, 1, NA, NA),
Y = c(NA, 200, NA, 1, NA),
Z = c(NA, NA, NA, 1, 1)
)
%>% group_by(ID) %>% summarise_all(function(x) ifelse(all(is.na(x)), NA, na.omit(x)[1])) data
6 Column-wise operations
Select specific columns:
# alternative: dat %>% select(mort, bmi))
%>% select(c(mort, bmi)) dat
Select first five columns:
%>% select(c(head(names(.),5))) dat
Unselect column(s):
# alternative: dat %>% select(-mort, -bmi))
%>% select(-c(mort, bmi)) dat
Uselect last five columns:
%>% select(-c(tail(names(.),5))) dat
6.1 Select columns based on condition
- Select only those columns that are numeric
- From those numeric columns, select only the ones that have all rows with values above 500. In this case, no column has all rows with values greater than 500.
%>% select(where(is.numeric)) %>% select_if(~all(.>500)) %>% ncol() dat
## [1] 0
- Select only those columns that are numeric
- From those numeric columns, select only the ones that have at least one value greater than 500. In this case, only column [dbp] has some values greater than 500.
%>% select(where(is.numeric)) %>% select_if(~any(.>500)) dat
# verify that filter is correct -> there are two rows with values greater than 500 in column dbp
%>% select(where(is.numeric)) %>% filter(dbp >500) dat
Select columns that contain at least one row with values “Sense especificar”, “N/A”:
%>% select(where(~any(. %in% c("Sense especificar", "N/A")))) dat
Remove columns that have NA in all rows:
%<>% select(-where(~all(is.na(.)))) dat
Select/Remove columns that have a root on the name:
library(dplyr)
# Remove variables with complete in the name
%>% select(-contains("complete"))
dat # Select variables with complete in the name
%>% select(contains("complete")) dat
6.2 Reorder columns
In this example, we set [numpacie], [edat] as first columns and then the remaining columns.
%>% select(numpacie, edat, everything()) dat
6.3 Column types
Display column class of one variable:
class(dat$mort)
## [1] "character"
Character column types of [mort], [tabac], [chd], [ecg] to be factor types:
# option 1
$mort <- as.factor(dat$mort)
dat$tabac <- as.factor(dat$tabac)
dat$chd <- as.factor(dat$chd)
dat$ecg <- as.factor(dat$ecg) dat
# option 2, instead of as.factor, we could also do as.numeric, as.character
%<>% mutate_at(vars(c(mort, tabac, chd, ecg)), as.factor) dat
Set all columns to character type:
%<>% mutate_all(as.character) dat
6.3.1 Change a column type to a specific type
Assume some columns are of type integer64 but you do not know exactly which ones are of type integer64. Replace all columns of type integer64 to nuemric type.
<- sapply(df, function(x) any(class(x) == "integer64"))
int64_columns
# Use dplyr to convert the identified columns to numeric
<- df %>%
df mutate(across(all_of(names(df)[int64_columns]), as.numeric))
6.3.2 Change column types based on condition
Change column types if column names contain string “fecha”:
<- names(dat %>% select(grep("fecha",names(.),value = T)))
var_date %<>% mutate_at(all_of(var_date), list(~as.Date(., "%d-%m-%Y")),.keep = all) dat
Change column types if column names also start with “grupo_”:
<- names(dat %>% select(mort, starts_with("grupo_")))
var_fac %<>% mutate_at(all_of(var_fac), as.factor) dat
6.4 Column names
Display column names: - colnames(): specific function for data frames - names(): can be used for data frames, lists, etc.
# option 1, can also do: names(dat)
colnames(dat)
## [1] "numpacie" "mort" "tempsviu" "edat" "bmi"
## [6] "edatdiag" "tabac" "sbp" "dbp" "ecg"
## [11] "chd" "id" "fruta_freq"
# option 2
%>% names() dat
## [1] "numpacie" "mort" "tempsviu" "edat" "bmi"
## [6] "edatdiag" "tabac" "sbp" "dbp" "ecg"
## [11] "chd" "id" "fruta_freq"
Get column names in between two columns:
# option 1
%>% select(tempsviu:ecg) %>% names() dat
## [1] "tempsviu" "edat" "bmi" "edatdiag" "tabac" "sbp" "dbp"
## [8] "ecg"
# option 2
names(dat %>% select(tempsviu:ecg))
## [1] "tempsviu" "edat" "bmi" "edatdiag" "tabac" "sbp" "dbp"
## [8] "ecg"
6.4.1 Check if a string is a column name
"num" %in% colnames(dat)
## [1] FALSE
"numpacie" %in% colnames(dat)
## [1] TRUE
Check if any column name contains the string “num”:
any(grepl("num", colnames(dat)))
## [1] TRUE
Select those column names that contain “num”:
colnames(dat)[grepl("num", colnames(dat))]
## [1] "numpacie"
6.4.2 Rename a group of columns
Assume you have many variables that start with “group_”. For instance, “group_1_X_Y_X”, “group_2_X_Y_XZZ”, “group_3_X_YXY”,..
Rename columns so that they are as follows:
# option 1: dat only keeps columns that start with "grupo_" and renames them
%<>%
dat select(starts_with("grupo_")) %>%
setNames(sub("grupo_(\\d+)_.*", "grupo_\\1", colnames(.)))
# option 2: keeps all columns and renames columns that start with "grupo_"
<- colnames(dat)[startsWith(colnames(dat), "grupo_")]
grupo_columns
# Rename the selected columns
colnames(dat)[startsWith(colnames(dat), "grupo_")] <- sub("grupo_(\\d+)_.*", "grupo_\\1", grupo_columns)
6.4.3 Modify column names
“Clean column names” so that they only contain “” characters (blank spaces replaced by ””), numbers and letters in lowercase.
p_load(janitor)
<- janitor::clean_names(dat) dat
Only rename selected column(s):
# option 1 -> by referencing column name
colnames(dat)[colnames(dat) == "numpacie"] <- "id_paciente"
colnames(dat)[colnames(dat) %in% c("numpacie", "tempsviu")] <- c("id_paciente", "temps_viu")
# option 2 -> not recommended, "numpacie" might not always be the first column
colnames(dat)[1] <- "id_paciente"
colnames(dat)[c(1,3)] <- c("id_paciente", "temps_viu")
# option 3
%>% colnames() dat
## [1] "numpacie" "mort" "tempsviu" "edat" "bmi"
## [6] "edatdiag" "tabac" "sbp" "dbp" "ecg"
## [11] "chd" "id" "fruta_freq"
%>% rename("id_paciente"="numpacie", "temps_viu"="tempsviu") %>% colnames() dat
## [1] "id_paciente" "mort" "temps_viu" "edat" "bmi"
## [6] "edatdiag" "tabac" "sbp" "dbp" "ecg"
## [11] "chd" "id" "fruta_freq"
6.4.3.1 Based on a parameter
= "A"
str_suffix %>% mutate(!!glue::glue("grup_{str_suffix}") = "A")
dat
= "hosp"
str_suffix %>% group_by(id) %>% dplyr::summarize(!!glue::glue("num_{str_suffix}") := n()) dat
6.4.3.2 Row as column names
# extract first row
<- as.character(dat[1,])
col_names # remove first row from dat
<- dat[-1, ]
dat # set first row as column names
colnames(dat) <- col_names
6.4.3.3 Add prefix/suffix
For this section, you need to revise regular expressions. For basic notions, check this website.
6.4.3.3.1 Suffix
To all columns, add them a suffix:
%>% rename_with(~paste0(., "_dataset1")) dat
# based on a parameter
<- "basal"
time_str %>% rename_with(~paste0(., glue::glue("_{time_str}"))) dat
To selected columns, add them a suffix:
%>% rename_at(vars(-numpacie), ~paste0(., "_dataset1")) dat
# based on a parameter
<- "basal"
time_str %>% rename_at(vars(-numpacie), ~paste0(., glue::glue("_{time_str}"))) dat
Remove “_digit” where digit could be 1, 50, 100, …:
%>% colnames() dat
## [1] "numpacie" "mort" "tempsviu" "edat" "bmi"
## [6] "edatdiag" "tabac" "sbp" "dbp" "ecg"
## [11] "chd" "id" "fruta_freq"
# first we add suffix to all column names
%<>% rename_with(~paste0(., "_", seq_along(.)))
dat %>% colnames() dat
## [1] "numpacie_1" "mort_2" "tempsviu_3" "edat_4"
## [5] "bmi_5" "edatdiag_6" "tabac_7" "sbp_8"
## [9] "dbp_9" "ecg_10" "chd_11" "id_12"
## [13] "fruta_freq_13"
p_load(stringr)# for str_remove()
%>% colnames() dat
## [1] "numpacie_1" "mort_2" "tempsviu_3" "edat_4"
## [5] "bmi_5" "edatdiag_6" "tabac_7" "sbp_8"
## [9] "dbp_9" "ecg_10" "chd_11" "id_12"
## [13] "fruta_freq_13"
# then we remove "_digit"
%<>% rename_with(~str_remove(.,"_\\d+$"), everything())
dat %>% colnames() dat
## [1] "numpacie" "mort" "tempsviu" "edat" "bmi"
## [6] "edatdiag" "tabac" "sbp" "dbp" "ecg"
## [11] "chd" "id" "fruta_freq"
6.4.3.3.2 Prefix
Same workaround as when adding a suffix.
To all columns, add them a prefix:
%>% rename_with(~paste0("var_",.))
dat
# based on a parameter
<- "basal"
time_str %>% rename_with(~paste0(glue::glue("_{time_str}", .))) dat
6.5 Column labels
Need to load Hmisc library.
Display current column labels:
p_load(Hmisc)
::label(dat) Hmisc
## numpacie mort tempsviu edat bmi edatdiag tabac
## "" "" "" "" "" "" ""
## sbp dbp ecg chd id fruta_freq
## "" "" "" "" "" ""
6.5.1 Modify column labels
Workaround is very similar to modifying the column names.
# option 1 -> by referencing column name
::label(dat[,"numpacie"]) <- "ID Paciente"
Hmisc
::label(dat,self = F)[names(dat) %in% c("numpacie", "tempsviu")] <- c("ID Paciente", "Tiempo Vivo") Hmisc
# option 2 -> not recommended, "numpacie" might not always be the first column
::label(dat)[1] <- "ID"
Hmisc
::label(dat,self = F)[c(1,3)] <- c("ID Paciente", "Tiempo Vivo") Hmisc
Modify all column labels:
# labels_cols <- sapply(dat1, Hmisc::label) if you want to have same labels as dat1 in dat
<- c("ID", "muerte", "tiempo vivo", "edad", "BMI", "edad diag", "tabaco", "SBP", "DBP", "ECG", "CHD")
labels_cols ::label(dat) <- as.list(labels_cols) Hmisc
6.6 Create Variables
Assume you have a this data set:
id, time, alive_v1, alive_v2, alive_v3,
ID1, t0, 1, NA, NA,
ID1, t1, NA, 1, NA,
ID1, t2, NA, NA, 0
You could just simply have:
id, time, alive,
ID1, t0, 1,
ID1, t1, 1,
ID1, t2, 0
%<>% mutate( alive = coalesce(alive_v1, alive_v2, alive_v3)) %>% dplyr::select(-c(alive_v1, alive_v2, alive_v3)) dat
6.6.1 Factor variables
Factor variable with more than 2 levels, you should indicate the order of the levels by specifying levels=c():
# option 1
%<>% mutate(fruta_freq=factor(case_when(edat<30 ~ "Nunca", edat <=30 & edat < 40 ~ "A veces",
dat >=40 & edat<50 ~ "Bastantes veces",
edat>=50 & edat<60 ~ "Cada dos días",
edat>= 60 & edat < 70 ~ "Varias al día",
edat >= 70 ~ "Siempre"), levels=c("Nunca", "A veces", "Bastantes veces", "Cada dos días", "Varias al día", "Siempre"))) edat
# option 2
$fruta_freq <- case_when(dat$edat<30 ~ "Nunca", dat$edat <=30 & dat$edat < 40 ~ "A veces",
dat$edat>=40 & dat$edat<50 ~ "Bastantes veces", dat$edat>=50 & dat$edat<60 ~ "Cada dos días", dat$edat >= 60 & dat$edat < 70 ~ "Varias al día", dat$edat >= 70 ~ "Siempre") dat
Based on another factor variable (reduce levels of a factor variable into a new factor variable with less number of levels):
In this example: - Set strings in [tabac] to lower case - If [tabac] starts with “no” or “ex”, level of new variable is “No” - If [tabac] starts with “fumador”, level of new variable is “Si”
p_load(stringr)
%<>% mutate(fumador=factor(case_when(str_detect(tolower(tabac), "^no") | str_detect(tolower(tabac), "^ex") ~ "No",
dat str_detect(tolower(tabac), "^fumador") ~ "Si"),
levels=c("No", "Si")))
6.6.2 Numeric variables
Generate variable that is the difference between two dates: - First create two Date columns - Then create new column that is the difference in days between these Date columns
%>% mutate(start_date=sample(seq(as.Date('1999/01/01'), as.Date('2000/01/01'), by="day"), 149),
dat end_date=sample(seq(as.Date('2000/01/02'), as.Date('2000/12/31'), by="day"), 149),
duration=as.numeric(difftime(end_date, start_date)))
Assume you have 20 factor variables with levels 1/0 and they all start with “V”. You would like to count the number of 1’s that each row has for all these “V” variables. Imagine that each “V” variable refers to illness type 1, illness type 2, …, illness type 20. In this example, counting the number of 1’s for each row implies knowing how many illnesses each row (patient) has.
In the example, patient with ID1 has 5 illnesses since there are five 1’s in the binary variables that start with “V”.
# na.rm=TRUE -> do the counting even if there is a NA in one of the "V" columns
as.data.frame(matrix(sample(c(0, 1, NA), 149*20, replace = TRUE), ncol=20)) %>% mutate(id=paste0("ID", seq(1:149))) %>% select(id, everything()) %>% mutate(num_illnesses=rowSums(select(., c(starts_with("V"))), na.rm=TRUE)) %>% head()
6.7 Modify Variables
Replace values in rows that contain “Sin especificar”:
In this example, the code replaces “Sin especificar” that is present in any column of the data set
<- replace(dat, dat == "Sin especificar", NA)
dat # could also do: replace(dat, dat == "Sin especificar" | "N/A", NA)
6.7.1 Replace values in column that match condition(s)
Replace value in [edat] for cases where [numpacie] is 1, otherwise [edat] remains unchanged:
# option 1
$edat[dat$numpacie==1] <- 10 dat
# option 2
## alternative: dat[which(dat$numpacie==1),]$edat
which(dat$numpacie==1),"edat"] <- 10 dat[
# option 3
## before
%>% filter(numpacie==1) %>% select(numpacie, edat) dat
## after
%>% mutate(edat = case_when(numpacie==1 ~ 10, numpacie!=1 ~ edat)) %>% filter(numpacie==1) %>% select(numpacie, edat) dat
By multiple conditions:
# option 1
## before
$edat[dat$tabac=="No fumador" & dat$ecg=="Normal"] dat
## [1] 10 47 49 54 40 38 44 45 35 44 47 78 71 66 42 67 86 42 60 81 60 45 61 82 50
## [26] 71 69 48 58 52 52 59 49 60 61 54 52 75
## after
$edat[dat$tabac=="No fumador" & dat$ecg=="Normal"] <- 10
dat$edat[dat$tabac=="No fumador" & dat$ecg=="Normal"] dat
## [1] 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
## [26] 10 10 10 10 10 10 10 10 10 10 10 10 10
# option 3
## before
%>% filter(tabac=="No fumador" & ecg=="Normal") %>% select(numpacie, edat) dat
## after
%>% mutate(edat = case_when(tabac=="No fumador" & ecg=="Normal" ~ 10,
dat !(tabac=="No fumador" & ecg=="Normal") | (is.na(tabac) | is.na(ecg)) ~ edat)) %>%
filter(tabac=="No fumador" & ecg=="Normal") %>% select(numpacie, edat)
6.7.2 Forward fill
This is useful when you have clinical variables, such as gender, that does not change over time but you only have them with information for time 0. The example below replaces the NA by the the value of gender that is not NA.
Toy example:
id, time, gender
01, 0, male
01, 1, NA
01, 2, NA
Now it would be:
id, time, gender
01, 0, male
01, 1, male
01, 2, male
# option 1
%<>% group_by(numpacie) %>% mutate(gender=gender[!is.na(gender)]) %>% ungroup() %>% as.data.frame() dat
# option 2 -> useful if you want to do it for more than one variable at the same time
%>% group_by(numpacie) %>% fill(c("gender"), .direction = "down") %>% ungroup() %>% as.data.frame() dat
6.7.3 Modify factor variables
Rename levels of a factor variable. In this example, previous levels are “Vivo” and “Muerto” and they are replaced by “alive” and “dead” respectively.
table(dat$mort)
##
## Muerto No Info Vivo
## 24 5 120
%>% mutate(mort = plyr::revalue(mort, c("Vivo" = "alive", "Muerto" = "dead"))) %>% pull(mort) %>% table() dat
## .
## alive dead No Info
## 120 24 5
Collapse levels
Collapse levels above 1 to level 1
# option 1
$x1 <- fct_collapse(dat$x1, "1"=levels(x1)[-1]) dat
# option 2
%>% mutate(x1=fct_collapse(x1, "1"=levels(x1)[-1])) dat
Collapse string levels
# option 1
$tabac <- fct_collapse(dat$tabac, "No fumador"= c("Ex fumador", "No fumador"))
dat
## could also just simply create new variable instead of replacing original variable
$fumador <- fct_collapse(dat$tabac, "No fumador"= c("Ex fumador", "No fumador")) dat
# option 2
%>% mutate(tabac=fct_collapse(tabac, "No fumador"= c("Ex fumador", "No fumador")))
dat
## could also just simply create new variable instead of replacing original variable
%>% mutate(fumador=fct_collapse(tabac, "No fumador"= c("Ex fumador", "No fumador"))) dat
Change levels for all factors in a data frame
# Change all values of factors variables to NA or the same value
for (i in 1:ncol(dat)) {
if (is.factor(dat[[i]])) {
levels(dat[[i]])[levels(dat[[i]]) %in% c("Unknown")] <- NA
} }
6.7.4 Modify columns using regular expressions
In this example, replaces rows in column [hour] that are not digits by NA:
$hour <- sample(c(round(runif(20, min=0, max=23.59), 2), "N/A", "09/09/9999"), 149, replace=TRUE)
dat
# first check if 9999 is in column
any(grepl("9999", dat$hour))
# check if string contains letters such as N, A from N/A
any(grepl("[a-zA-Z]+", dat$hour))
# now replace any values that are not digit with NA
$hour <- ifelse(grepl("\\d", dat$hour), dat$hour, NA) dat
7 Row-wise and column-wise operations
- Select [numpacie] column and columns with column names that start with “grupo_”
- For columns with column names that start with “grupo_”, filter rows that have values different than “0”.
For columns with column names that start with “grupo_”, filter rows that have values different than “0” in all of the columns that start with “grupo_”:
%>% select(numpacie, starts_with("grupo_")) %>% filter_at(vars(starts_with("grupo_")), all_vars(.!="0")) dat
For columns with column names that start with “grupo_”, filter rows that have at least a value different than “0” in one of the columns that start with “grupo_”:
%>% select(numpacie, starts_with("grupo_")) %>% filter_at(vars(starts_with("grupo_")), any_vars(.!="0")) dat
7.1 Group by
7.1.1 Access grouped data
<- dat %>% group_by(id)
dat_grups <- group_split(dat_grups)
ls_grups # access first group
1]] ls_grups[[
Now assume you want to access the specific group for ID “1141959”:
<- ls_grups %>% map(~ filter(.x, id_cas == "1141959"))
res !sapply(res, function(df) nrow(df) == 0)] res[
8 Compare data frames
Check which IDs in dataset 1 are not present in dataset 2, assuming that ID codification is the same in both data sets:
setdiff(dat1$id,dat2$id)
%>% filter(id %in% dat2$id) %>% nrow() dat1
8.1 Compare count of IDs in each data frame
# Calculate counts of IDs in each dataset
<- table(dat1$id_cas)
counts_dat1 <- table(dat2$id_cas)
counts_dat2
# Find all unique IDs across both datasets
<- union(names(counts_dat1), names(counts_dat2))
all_ids
# Initialize a data frame to store the results
<- data.frame(ID = all_ids, Count_in_dat1 = 0, Count_in_dat2 = 0)
result_df
# Fill in the counts for each dataset
$Count_in_dat1[result_df$ID %in% names(counts_dat1)] <- counts_dat1[result_df$ID %in% names(counts_dat1)]
result_df$Count_in_dat2[result_df$ID %in% names(counts_dat2)] <- counts_dat2[result_df$ID %in% names(counts_dat2)] result_df
8.2 Check for common column names
intersect(names(df1), names(df2))
9 Join data frames
Can join data frames by using merge() or join() functions from dplyr package. In tidyverse, a “merge” is called a “join” which is the term more commonly used in relational databases.. This manual works with join():
This is are most common types of joins:
inner_join() corresponds to merge(X, Y) – only retain rows that match between X and Y.
left_join() corresponds to merge(X, Y, all.x = T) – retain all rows from X, whether or not they match in Y.
right_join() corresponds to merge(X, Y, all.y = T) – retain all rows from Y, whether or not they match in X.
full_join() corresponds to merge(X, Y, all = T) – retain all rows from both X and Y, whether or not they match.
# option 1
<- full_join(dat1, dat2) dat
# option 2 using pipe
<- dat1 %>% full_join(dat2) dat
10 Missing data
<- as.data.frame(matrix(sample(c(0, 1, NA), 149*20, replace = TRUE), ncol=20)) %>% mutate(id=paste0("ID", seq(1:149))) %>% select(id, everything())
dat2 head(dat2)
Check if data frame has at least one NA:
any(is.na(dat2))
## [1] TRUE
Count number of missings for each column:
# option 1
sapply(dat2, function(x) sum(is.na(x)))
## id V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19
## 0 51 56 44 55 54 49 55 51 49 49 50 39 50 41 48 38 57 44 56
## V20
## 58
# option 2
colSums(is.na(dat2))
## id V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19
## 0 51 56 44 55 54 49 55 51 49 49 50 39 50 41 48 38 57 44 56
## V20
## 58
Count number of missings for each row:
rowSums(is.na(dat2))
## [1] 5 2 6 2 8 6 7 6 8 5 5 7 9 3 5 8 7 9 4 3 5 11 9 4 6
## [26] 10 4 10 6 7 8 4 10 5 8 9 3 11 10 6 5 2 2 10 7 5 8 8 13 7
## [51] 8 7 6 6 7 4 6 4 10 11 6 7 7 6 5 6 7 5 7 7 9 3 9 6 11
## [76] 9 6 4 4 5 12 7 6 6 8 5 4 4 9 6 6 6 5 7 11 5 9 7 7 12
## [101] 6 4 6 6 9 5 6 7 7 12 8 5 9 7 3 7 8 5 6 9 5 6 7 6 9
## [126] 7 8 5 7 10 6 7 8 6 8 5 5 9 6 9 6 5 7 7 3 4 8 7 5
# alternative -> rowSums(is.na(dat2) | dat2 == "" | dat2 == " ")
# create column that counts NAs per row
# alternative -> dat2 %>% mutate(row_count_na=rowSums(is.na(select(., -id))))
%>% mutate(row_count_na=rowSums(is.na(select(., everything())))) dat2
11 Generate matrix
Slicing:
- Select a row: mat[2, ] -> select row 2
- Select a column: mat[,2] -> select second column
- Select an element: mat[2,2] -> select element in row 2 and column 2
# option 1
matrix(c(1,1, 2,2), nrow=2, ncol=2)
## [,1] [,2]
## [1,] 1 2
## [2,] 1 2
as.data.frame(matrix(c(1,1, 2,2), nrow=2, ncol=2))
# option 2 -> matrix(c(col1_vector, col2_vector), 2)
matrix(c(c(1,1), c(2,2)), ncol=2)
## [,1] [,2]
## [1,] 1 2
## [2,] 1 2
# option 3 -> column wise: cbind(col1_vector, col2_vector)
cbind(c(1,1), c(2,2))
## [,1] [,2]
## [1,] 1 2
## [2,] 1 2
# option 4 -> row wise: rbind(row1_vector, row2_vector)
rbind(c(1,2), c(1,2))
## [,1] [,2]
## [1,] 1 2
## [2,] 1 2
12 Plots
12.1 GGPLOT
ggplot using parameters:
<- "bsa"
var_out <- "contraste"
var_comp
ggplot(dat, aes(!!sym(var_out))) + geom_bar(aes(fill = !!sym(var_comp)), position = "dodge")
Add two ggplots in same figure:
# first plot is the density function of var_num grouped by var_fact
<- ggplot(dat, aes(x = var_num, colour = var_fac)) + geom_density()
plot1
# boxplot of var_num grouped by var_fact
<- ggplot(dat %>% filter(estudi=="Normal") %>% filter(complete.cases(t1_n_siv_ec, edat_cat)), aes(x = edat_cat, y = contraste, fill = edat_cat)) +
plot2 stat_boxplot(geom = "errorbar",
width = 0.25) +
geom_point(aes(fill = edat_cat), shape = 21,
position = position_jitterdodge()) +
geom_boxplot(alpha = 0.3)
library(gridExtra)
# Arrange the plots side by side
grid.arrange(plot1, plot2, ncol=2)
Add several ggplots within a loop in same figure:
# Initialize lblank list
<- list()
p
# loop for plot individual graphs and asve in a list
for (i in 1:length(var_long)) {
<- dat_3 %>% filter(Var == var_long[i])
datc
cat(" \n#####", var_long[i], " \n")
### GRAFIC
<- ggplot(datc, aes_string(x = var_event, y = var_valor)) +
p[[i]]
geom_line(aes_string(group = var_id), alpha = 0.5) +
stat_summary(aes(group = 1), geom = "line", fun.y = mean,
shape = 17, size = 2, col = "red") +
theme_classic() +
labs( title = paste("Evolución global", var_long[i]),
x = NULL, y = var_valor, color = NULL)
print(p[[i]])
cat("\n")
}
# Arrange in a unique plot
do.call(grid.arrange,p)
Export graphs to pptx
#open eoffice library
library(eoffice)
#export ggplot to pptx file, option append is possible
topptx(plot2, filename = file.path( "correla.pptx"))
13 Tables
13.1 Transposed descriptive table
Descriptive table similar to the one obtained with the desc_numeric()
function but taking into account several numerical variables at the same time.
<- 90
complete_obs library(glmnet)
<- names(dat %>% select(rechazo_agudo:dehiscencia_de_sutura_bronquial,-retiro_de_grapas))
varsurv
<- NAperc(dat[,varsurv],maxNA = 100 - complete_obs)$var
vardel <- varsurv[!varsurv %in% vardel]
var_surv
<- names(dat %>% select(tiempo_isquemia_pulmon_derecho,tiempo_isquemia_pulmon_izquierdo,dias_uci,dias_intubacion,retiro_de_grapas,dias_ingreso_hosp,tiempo))
vart# tiempo_ecmo_pre solo tiene 1 valor mayor a 0
<- NAperc(dat[,vart],maxNA = 100 - complete_obs)$var
vardel <- vart[!vart %in% vardel]
var_t
<- var_surv[order(var_surv)]
var_surv_ord
<- list()
sumar <- list()
results <- matrix()
resul for (i in seq_along(var_t)) {
<- list()
sumar[[i]] <- data.frame()
results[[i]] <- c()
rows for (j in seq_along(var_surv_ord)) {
<- summary.quanti(data = dat,x = var_t[i], group = var_surv_ord[j], show.all = F,show.n = F,var.tidy = F, method = "non-param",
sumar[[i]][[j]] nround = 1)
#sumar[[i]][[j]]$summary$variable <- " "
#colnames(sumar[[i]][[j]]$summary)[1] <- c(label(dat[var_surv_ord[j]]))
<- t(sumar[[i]][[j]]$summary[,-1])
sumar[[i]][[j]] <- as.data.frame(sumar[[i]][[j]])
sumar[[i]][[j]] <- tibble::rownames_to_column(sumar[[i]][[j]], " ")
sumar[[i]][[j]] <- c(rows, rep(label(dat[var_surv_ord[j]]),nrow(sumar[[1]][[j]])))
rows <- rbind(results[[i]],sumar[[i]][[j]])
results[[i]]
}
<- cbind(resul,results[[i]])
resul
}
<- resul %>% select(-c(` `,resul))
r <- resul$` `
n
<- cbind(n,r)
final_re colnames(final_re)[1] <- " "
<- final_re %>% kable(longtable = T,escape = F, row.names = F, caption = "Summary of results for time variables",align = "c" ) %>%
kk kable_styling(latex_options = c("striped","hold_position", "repeat_header"), full_width = F, fixed_thead = T) %>%
row_spec(0, color = "white", background = "#993489") %>%
column_spec(which(names(final_re) == " "), width_max = "9em",bold = T) %>%
pack_rows(index = table(rows)) %>%
add_footnote(escape = F, notation = "symbol" )%>%
row_spec(which(final_re$` ` == "p.value"), bold = T, align = "right")
print(kk)
14 Printing tricks
Use the knitr chunk option results = "asis"
to print the html table cat(tab_model()$knitr)
More elaborated code:
<- list()
pl
for (i in seq_along(var_long)) {
cat(" \n####", var_long[i], " \n")
### GRAFIC
print(ggplot(dat3_long, aes_string(x = var_event, y = var_long[i])) +
geom_line(aes_string(group = var_id), alpha = 0.5) +
stat_summary(aes(group = 1), geom = "line", fun.y = mean,
shape = 17, size = 3) +
theme_bw() +
labs( title = paste("Evolución global", var_long[i]),
x = NULL, y = var_long[i], color = NULL))
cat("\n")
### MODEL
<- lmeControl(opt = 'optim')
ctrl <- as.formula(paste0(var_long[i], "~", var_event))
frml
############################################################### REVISAR AMB CALMA EFECTES ALEATORIS EN CADA ESTUDI ########################################
<- lme(frml, random = ~ 1 | id_sujeto, data = dat3_long,
model1 method = "REML", control = ctrl)
<- lme(frml, random = ~ Tiempo | id_sujeto, data = dat3_long,
model2 method = "REML", control = ctrl)
if (na.omit(anova(model1, model2)$"p-value") < 0.05) {
<- model2
model else {
} <- model1
model
}
cat(tab_model(model)$knitr,"\n--------\n") # THIS TABLE
cat("\n")
print(plot_model(model,colors = "#8d35cc",show.values = T) )
cat("\n")
}
Simple code:
# need this to use tab_model()
library(sjPlot)
library(sjmisc)
library(sjlabelled)
data(mtcars)
<- lm(mpg ~ cyl + hp + wt, data = mtcars)
m.mtcars cat(tab_model(m.mtcars)$knitr)
mpg | |||
---|---|---|---|
Predictors | Estimates | CI | p |
(Intercept) | 38.75 | 35.09 – 42.41 | <0.001 |
cyl | -0.94 | -2.07 – 0.19 | 0.098 |
hp | -0.02 | -0.04 – 0.01 | 0.140 |
wt | -3.17 | -4.68 – -1.65 | <0.001 |
Observations | 32 | ||
R2 / R2 adjusted | 0.843 / 0.826 |