Data manipulation


VHIR-UEB-for-019v.01

1 Directory commands

  • getwd() tells you what is the path of the R project (.Rproj)
getwd()
## [1] "D:/Dropboxvhir/Nuevo Equipo VHIR10 Dropbox/Estela Ariza Moral/UEB-compartida/Aplicacions i Programari/R_Manual"
  • setwd() allows to set/change the directory for the R project (.Rproj)
# This example sets the directory in Desktop folder, R project is saved in "e:/user.name/Desktop/my_project.Rproj"
setwd("e:/user.name/Desktop")
  • dir() extracts the file names, by default extract the file names of the current working directory of the R project
# same as dir(getwd()):
dir()
## [1] "data"                            "diabetes.xls"                   
## [3] "my_work_space.RData"             "R_Manual-DataManipulation.html" 
## [5] "R_Manual-DataManipulation.Rmd"   "R_Manual-DataManipulation.Rproj"
## [7] "Rcode"

Get the file names of another working directory:

path <- "C:/user.name/directory/file/"
dir(path)
## character(0)

Print the objects in environment:

print(ls())
## [1] "path"

Remove objects from environment:

# removes all objects from memory
rm(list=ls())
print(ls())
# remove specific objects
rm(dat, path)

2 Good to know

  • Packages require() and library() both load a R package, but they have a main difference:
    • require() will output a warning if a package is not installed and then continue to execute the code. It is usually only used if you are loading packages inside a function so that the function will continue to execute even if a package doesn’t exist.
    • library() will output an error and stop the execution of the code.
  • To install old versions of packages, try these options in a new R Script:
# option 1

remove.packages("knitr") # remove the actual version
require(remotes)
remotes::install_version(package ="knitr", version ="1.43", repos=getOption("repos"))


# option 2

remove.packages("kableExtra") # remove the actual version
require(utils)
packageurl <- "https://cran.r-project.org/src/contrib/Archive/kableExtra/kableExtra_1.3.4.tar.gz"
utils::install.packages(pkgs = packageurl, repos = NULL,type = "source")
  • p_load() allows to install and load a R package, but first install and load package pacman.

    • Example: p_load(ggplot2, dplyr)
  • identical() returns boolean TRUE/FALSE if two R objects (data frames, vectors, matrices, …) are the same.

    • Example: identical(dat1, dat2)
  • To access for more information for:

    • For a given function of a R package, in the console type “?func”.
      • Example: ?identical
    • For a package:
      • Example: help(package=“dplyr”)
  • Load package magrittr to use operator %<>% in R.

  • apply(), lapply(), sapply():

    • apply(): lets us apply a function to the rows or columns of a matrix or data frame.
      • Example: apply( sample_matrix, 2, mean) -> use apply() function across column to find mean of each column
    • lapply(): lets us apply a function to list objects.
      • Example: lapply(names, toupper) -> all strings in list names are in capital letters
    • sapply(): lets us apply a function to lists, vectors, data frames and returns an array or matrix object.
      • Example: sapply(sample_data, max) -> get the maximum from columns of data set sample_data
  • Function filter() + combined with operator ! -> be careful. For example, assume you have a factor variable [cat] that takes values “No”, “Yes” and some are NA. If you do dat %>% filter(cat!=“No”) it will exclude those rows where NA in [cat], but maybe you are still interested in keeping those cases even if they have [cat]=NA. In that case, do dat %>% filter(cat!=“No” | is.na(cat)).

    • Example: dat %>% filter(grupo_1!=“0” | is.na(grupo_1))
  • Use a particular function from a package. This is useful when you load two packages that have a function with the same name.

    • Example: dplyr::select() -> needs to be specified if in the same project you load package dplyr and MASS (MASS::select()). Both packages have a select() function.
  • Load built-in dataset into the environment

    • Example: data(iris)
  • Difference between paste() and paste0():

    • paste0(): allows concatenating strings but by default does not leave a blank space in between strings. Example: paste0(“ID”, seq(10))
    • paste(): allows concatenating strings and by default leaves a blank space in between strings. Example: paste(“ID”, seq(10))

3 Read data

3.1 xls/xlsx

p_load(readxl)
dat <- read_excel("diabetes.xls") # read_excel("diabetes.xlsx")
head(dat)
class(dat)
## [1] "tbl_df"     "tbl"        "data.frame"
p_load(rio)
# form rio package -> same as rio::import("diabetes.xls")
dat <- import("diabetes.xls")
class(dat)
## [1] "data.frame"
# export 
export("diabetes.xls")

3.2 Flat file databases

In flat file databases, records (rows) follow a uniform format, and there are no structures for indexing or recognizing relationships between records (rows).

util package is loaded by default when you load R and it has these functions:

  • read.table() : Main function. Reads a file in table format and creates a data frame from it. It offers many arguments to classify the incoming data.
  • read.csv() : Wrapper function for read.table(). Used to read comma-separated (CSV) files.
  • read.delim() : Wrapper Function used to read tab-separated files. read.delim() is used if the numbers in your file use periods(.) as decimals.
  • read.csv2() : read.csv() and read.csv2() are identical. The only difference is that they are set up depending on whether you use periods or commas as decimal points in numbers.
  • read.delim2() : read.delim2 is used when the numbers in your file use commas(,) as decimals.
  • rio package can also be used for importing and exporting flat files

3.3 R data files

Rdata can be used to save one or multiple objects or the entire work session.

# Saving on object in RData format
save(dat, file = "my_data.RData")
# Save multiple objects
save(data1, data2, file = "my_data.RData")
# To load the data again
load("my_data.RData")

Store the entire work session:

# export
save.image(file = "my_work_space.RData")
# import
load("my_work_space.RData")

Rds used to save one single object.

# export
saveRDS(dat, file="my_data.rds")
# import
readRDS(file="my_data.rds")

4 Basic data frame operations

From now on, we use package dplyr.

p_load(dplyr, magrittr, forcats)

Display structure of data frame:

str(dat)
## 'data.frame':    149 obs. of  11 variables:
##  $ numpacie: num  1 2 3 4 5 6 7 8 9 10 ...
##  $ mort    : chr  "Vivo" "Vivo" "Vivo" "Vivo" ...
##  $ tempsviu: num  12.4 12.4 9.6 7.2 14.1 14.1 12.4 14.2 12.4 14.5 ...
##  $ edat    : num  44 49 49 47 43 47 50 36 50 49 ...
##  $ bmi     : num  34.2 32.6 22 37.9 42.2 33.1 36.5 38.5 41.5 34.1 ...
##  $ edatdiag: num  41 48 35 45 42 44 48 33 47 45 ...
##  $ tabac   : chr  "No fumador" "Fumador" "Fumador" "No fumador" ...
##  $ sbp     : num  132 130 108 128 142 156 140 144 134 102 ...
##  $ dbp     : num  96 72 58 76 80 94 86 88 78 68 ...
##  $ ecg     : chr  "Normal" "Normal" "Normal" "Frontera" ...
##  $ chd     : chr  "No" "No" "Si" "Si" ...

Display most relevant statistics of all variables in data frame:

summary(dat)
##     numpacie          mort              tempsviu          edat      
##  Min.   :  1.00   Length:149         Min.   : 0.00   Min.   :31.00  
##  1st Qu.: 38.00   Class :character   1st Qu.: 7.30   1st Qu.:43.00  
##  Median : 75.00   Mode  :character   Median :11.60   Median :50.00  
##  Mean   : 75.01                      Mean   :10.52   Mean   :52.17  
##  3rd Qu.:112.00                      3rd Qu.:13.90   3rd Qu.:60.00  
##  Max.   :149.00                      Max.   :16.90   Max.   :86.00  
##       bmi           edatdiag        tabac                sbp       
##  Min.   :18.20   Min.   :26.00   Length:149         Min.   : 98.0  
##  1st Qu.:26.60   1st Qu.:38.00   Class :character   1st Qu.:124.0  
##  Median :31.20   Median :45.00   Mode  :character   Median :138.0  
##  Mean   :31.78   Mean   :45.99                      Mean   :139.1  
##  3rd Qu.:35.20   3rd Qu.:53.00                      3rd Qu.:152.0  
##  Max.   :59.70   Max.   :81.00                      Max.   :222.0  
##       dbp             ecg                chd           
##  Min.   : 58.00   Length:149         Length:149        
##  1st Qu.: 74.00   Class :character   Class :character  
##  Median : 80.00   Mode  :character   Mode  :character  
##  Mean   : 90.04                                        
##  3rd Qu.: 88.00                                        
##  Max.   :862.00
# for a particular column
summary(dat$numpacie)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   38.00   75.00   75.01  112.00  149.00

Reset row index:

# useful when you have previously sliced (filtered) the data frame using dat[rows_filtered, ]
rownames(dat) <- NULL

Generate new column that takes the values of the row indices:

# option 1
dat %>% tibble::rownames_to_column(var="Old_RowIndex")
# option 2
dat %<>% mutate(Old_RowIndex=row_number())

Create empty data frame:

# data.frame(col1_name=character(n), col2_name=character(n), col3_name=character(n))
data.frame(col1=character(2), col2=integer(2), col3=numeric(2)) # n=2 -> empty data frame with two rows

5 Row-wise operations

5.1 Check for uniqueness in a column

Filter rows that appear more than once in column [numpacie]:

# option 1
# dat %>% filter(numpacie %in% names(which(table(dat$numpacie) > 1))) %>% View() -> same but it previous the operation in the View window
dat %>% filter(numpacie %in% names(which(table(dat$numpacie) > 1)))
# To verify that filter is correct:
dat %>% filter(numpacie=="135")# indeed, numpacie 135 has two records
# option 2
dat %>% arrange(numpacie) %>% group_by(numpacie) %>% filter(n()>1)

And only view data frame for columns [numpacie], [mort]:

# dat %>% filter(numpacie %in% names(which(table(dat$numpacie) > 1))) %>% select(numpacie, mort, edat, bmi) %>% View() -> same but it previous the operation in the View window
dat %>% filter(numpacie %in% names(which(table(dat$numpacie) > 1))) %>% select(numpacie, mort, edat, bmi)

5.2 Check uniqueness of two columns

Check if each [numpacie] is assigned to a unique [id]: in this example, numpacie 135 is assigned to id 134 and id 135.

dat %>% group_by(numpacie) %>% filter(n_distinct(id)>1)
# To verify that filter is correct:
dat %>% filter(numpacie=="135")# indeed, numpacie 135 is assigned to two ids

Check if each [id] is assigned to a unique [numpacie]: in this example, id 1 is assigned to numpacie 1, 2. And id 5 is assigned to numpacie 5,6.

dat %>% group_by(id) %>% filter(n_distinct(numpacie)>1)

5.3 Filter rows

5.3.1 Remove NA rows

# option 1
dat <- dat[apply(dat, 1, function(row) all(is.na(row))), ]
# option 2
dat <- dat[rowSums(is.na(dat))==ncol(dat), ]

Now remove rows where all columns are NA or where column ID has NA:

dat <- dat[ !(is.na(dat$id) | rowSums(is.na(dat))==ncol(dat)), ]

5.3.2 Filter firsts rows based on condition

Remove firsts rows until you find row that has character “PAT_ID” regardless of the column where “PAT_ID” is in.

# find row index where "PAT_ID" appears in any column
row_idx <- which(apply(dat, 1, function(row) any(row=="PAT_ID")))[1]
# remove rows before row index where "PAT_ID" appears in any column
dat <- dat[(row_idx):nrow(dat), ]

5.3.3 Filter rows that do not match condition(s)

# option 1
dat[-which(dat$mort == "Vivo" & dat$tabac == "No fumador"),]
# option 2 -> row index is the count of rows that match condition
dat %>% filter(!(mort == "Vivo" & dat$tabac == "No fumador"))

Both options are not identical because option 1 keeps original row index when slicing but option 2 resets the row index.

identical(dat %>% filter(!(mort == "Vivo" & dat$tabac == "No fumador")), dat[-which(dat$mort == "Vivo" & dat$tabac == "No fumador"),])
## [1] FALSE

5.3.4 Filter rows that match condition(s)

Match one condition:

# option 1
dat[which(dat$tabac == "No fumador"),]
# option 2
dat %>% filter(tabac == "No fumador")
# To check that filter is correct:
dat %>% filter(tabac == "No fumador") %>% nrow()# is 57
## [1] 57
table(dat$tabac)#count is 57 for 'No fumador'
## 
## Ex fumador    Fumador No fumador 
##         41         51         57

Match two conditions:

You could also add more conditions for filtering rows.

# option 1
dat[which(dat$mort == "Vivo" & dat$tabac == "No fumador"),]
# option 2
dat %>% filter(mort == "Vivo" & dat$tabac == "No fumador")

Filter rows that have “veces” or “día” in column [fruta_freq]

dat %>% filter(grepl("veces|día", fruta_freq)) %>% select(numpacie, fruta_freq)

5.3.5 Filter rows that have same value in all columns

Filter those rows that at least one of the columns have ‘No Info’:

# same as: dat %>% filter(if_any(everything(), ~. == "No Info"))
dat %>% filter(if_any(everything(), ~. %in% "No Info"))

Filter those rows that have ‘No Info’ in all columns:

# alternative: dat %>% filter(if_all(everything(), ~. == "No Info"))
dat %>% filter(if_all(everything(), ~. %in% "No Info"))

Filter those rows that have ‘No Info’ in all columns regardless of the values in columns [numpacie], [edat]. In other words, [numpacie] and [edat] can have ‘No Info’ or something else, they will not be taken into account when checking condition:

dat %>% filter(if_any(.cols=-c(numpacie, edat), ~. %in% "No Info"))

Filter rows where at least one numeric column has a value greater than 500. In this example, [dbp] has two rows with values greater than 500:

dat %>% filter(if_any(.cols=where(is.numeric),~.>500))

5.3.6 Filter rows based on a list

5.3.6.1 Match condition

Only keep rows where numpacie is in list:

dat %>% filter(numpacie %in% c("8", "19", "20"))

Do not keep rows where numpacie is in list:

dat %>% filter(!numpacie %in% c("8", "19"))

5.3.6.2 Match conditions

Do not keep rows where numpacie is in list and keep rows where age is above 40:

dat %>% filter(!numpacie %in% c("8", "19") & edat > 40)

Do not keep rows where numpacie is in list nor age is above 40:

dat %>% filter(!(edat>40) & !numpacie %in% c("8", "19"))

5.3.7 Filter row with last information in a given column

This is useful if you have a data frame in long format, that is, several rows refer to the same individual.

For each individual, keep the last row that has information in column [vivo] for each

dat1 %>% group_by(id) %>% filter(row_number() == max(which(complete.cases(vivo)))) %>% ungroup() %>% as.data.frame()

5.4 Collapse rows

Assume you want to collapse multiple rows into one single row:

data <- data.frame(
  ID = c("01", "01", "01", "02", "02"),
  X = c(NA, 1, 1, NA, NA),
  Y = c(NA, 200, NA, 1, NA),
  Z = c(NA, NA, NA, 1, 1)
)

data %>% group_by(ID) %>% summarise_all(function(x) ifelse(all(is.na(x)), NA, na.omit(x)[1]))

6 Column-wise operations

Select specific columns:

# alternative: dat %>% select(mort, bmi))
dat %>% select(c(mort, bmi))

Select first five columns:

dat %>% select(c(head(names(.),5)))

Unselect column(s):

# alternative: dat %>% select(-mort, -bmi))
dat %>% select(-c(mort, bmi))

Uselect last five columns:

dat %>% select(-c(tail(names(.),5)))

6.1 Select columns based on condition

  • Select only those columns that are numeric
  • From those numeric columns, select only the ones that have all rows with values above 500. In this case, no column has all rows with values greater than 500.
dat %>% select(where(is.numeric)) %>% select_if(~all(.>500)) %>% ncol()
## [1] 0
  • Select only those columns that are numeric
  • From those numeric columns, select only the ones that have at least one value greater than 500. In this case, only column [dbp] has some values greater than 500.
dat %>% select(where(is.numeric)) %>% select_if(~any(.>500))
# verify that filter is correct -> there are two rows with values greater than 500 in column dbp
dat %>% select(where(is.numeric)) %>% filter(dbp >500)

Select columns that contain at least one row with values “Sense especificar”, “N/A”:

dat %>% select(where(~any(. %in% c("Sense especificar", "N/A"))))

Remove columns that have NA in all rows:

dat %<>% select(-where(~all(is.na(.))))

Select/Remove columns that have a root on the name:

library(dplyr)
# Remove  variables with complete in the name
dat %>% select(-contains("complete"))
# Select  variables with complete in the name
dat %>% select(contains("complete"))

6.2 Reorder columns

In this example, we set [numpacie], [edat] as first columns and then the remaining columns.

dat %>% select(numpacie, edat, everything())

6.3 Column types

Display column class of one variable:

class(dat$mort)
## [1] "character"

Character column types of [mort], [tabac], [chd], [ecg] to be factor types:

# option 1
dat$mort <- as.factor(dat$mort)
dat$tabac <- as.factor(dat$tabac)
dat$chd <- as.factor(dat$chd)
dat$ecg <- as.factor(dat$ecg)
# option 2, instead of as.factor, we could also do as.numeric, as.character
dat %<>% mutate_at(vars(c(mort, tabac, chd, ecg)), as.factor)

Set all columns to character type:

dat %<>% mutate_all(as.character)

6.3.1 Change a column type to a specific type

Assume some columns are of type integer64 but you do not know exactly which ones are of type integer64. Replace all columns of type integer64 to nuemric type.

int64_columns <- sapply(df, function(x) any(class(x) == "integer64"))

# Use dplyr to convert the identified columns to numeric
df <- df %>%
  mutate(across(all_of(names(df)[int64_columns]), as.numeric))

6.3.2 Change column types based on condition

Change column types if column names contain string “fecha”:

var_date <- names(dat %>% select(grep("fecha",names(.),value = T)))
dat %<>%  mutate_at(all_of(var_date), list(~as.Date(., "%d-%m-%Y")),.keep = all)

Change column types if column names also start with “grupo_”:

var_fac <- names(dat %>% select(mort, starts_with("grupo_")))
dat %<>% mutate_at(all_of(var_fac), as.factor)

6.4 Column names

Display column names: - colnames(): specific function for data frames - names(): can be used for data frames, lists, etc.

# option 1, can also do: names(dat)
colnames(dat)
##  [1] "numpacie"   "mort"       "tempsviu"   "edat"       "bmi"       
##  [6] "edatdiag"   "tabac"      "sbp"        "dbp"        "ecg"       
## [11] "chd"        "id"         "fruta_freq"
# option 2
dat %>% names()
##  [1] "numpacie"   "mort"       "tempsviu"   "edat"       "bmi"       
##  [6] "edatdiag"   "tabac"      "sbp"        "dbp"        "ecg"       
## [11] "chd"        "id"         "fruta_freq"

Get column names in between two columns:

# option 1
dat %>% select(tempsviu:ecg) %>% names()
## [1] "tempsviu" "edat"     "bmi"      "edatdiag" "tabac"    "sbp"      "dbp"     
## [8] "ecg"
# option 2
names(dat %>% select(tempsviu:ecg))
## [1] "tempsviu" "edat"     "bmi"      "edatdiag" "tabac"    "sbp"      "dbp"     
## [8] "ecg"

6.4.1 Check if a string is a column name

"num" %in% colnames(dat)
## [1] FALSE
"numpacie" %in% colnames(dat)
## [1] TRUE

Check if any column name contains the string “num”:

any(grepl("num", colnames(dat)))
## [1] TRUE

Select those column names that contain “num”:

colnames(dat)[grepl("num", colnames(dat))]
## [1] "numpacie"

6.4.2 Rename a group of columns

Assume you have many variables that start with “group_”. For instance, “group_1_X_Y_X”, “group_2_X_Y_XZZ”, “group_3_X_YXY”,..

Rename columns so that they are as follows:

# option 1: dat only keeps columns that start with "grupo_" and renames them
dat %<>%
    select(starts_with("grupo_")) %>%
    setNames(sub("grupo_(\\d+)_.*", "grupo_\\1", colnames(.)))
# option 2: keeps all columns and renames columns that start with "grupo_"
grupo_columns <- colnames(dat)[startsWith(colnames(dat), "grupo_")]

# Rename the selected columns
colnames(dat)[startsWith(colnames(dat), "grupo_")] <- sub("grupo_(\\d+)_.*", "grupo_\\1", grupo_columns)

6.4.3 Modify column names

“Clean column names” so that they only contain “” characters (blank spaces replaced by ””), numbers and letters in lowercase.

p_load(janitor)
dat <- janitor::clean_names(dat)

Only rename selected column(s):

# option 1 -> by referencing column name
colnames(dat)[colnames(dat) == "numpacie"] <- "id_paciente" 

colnames(dat)[colnames(dat) %in% c("numpacie", "tempsviu")] <- c("id_paciente", "temps_viu")
# option 2 -> not recommended, "numpacie" might not always be the first column
colnames(dat)[1] <- "id_paciente" 

colnames(dat)[c(1,3)] <- c("id_paciente", "temps_viu")
# option 3
dat %>% colnames()
##  [1] "numpacie"   "mort"       "tempsviu"   "edat"       "bmi"       
##  [6] "edatdiag"   "tabac"      "sbp"        "dbp"        "ecg"       
## [11] "chd"        "id"         "fruta_freq"
dat %>% rename("id_paciente"="numpacie", "temps_viu"="tempsviu") %>% colnames()
##  [1] "id_paciente" "mort"        "temps_viu"   "edat"        "bmi"        
##  [6] "edatdiag"    "tabac"       "sbp"         "dbp"         "ecg"        
## [11] "chd"         "id"          "fruta_freq"

6.4.3.1 Based on a parameter

str_suffix = "A"
dat %>% mutate(!!glue::glue("grup_{str_suffix}") = "A")

str_suffix = "hosp"
dat %>% group_by(id) %>% dplyr::summarize(!!glue::glue("num_{str_suffix}") := n())

6.4.3.2 Row as column names

# extract first row
col_names <- as.character(dat[1,])
# remove first row from dat
dat <- dat[-1, ]
# set first row as column names
colnames(dat) <- col_names

6.4.3.3 Add prefix/suffix

For this section, you need to revise regular expressions. For basic notions, check this website.

6.4.3.3.1 Suffix

To all columns, add them a suffix:

dat %>% rename_with(~paste0(., "_dataset1"))
# based on a parameter
time_str <- "basal"
dat %>% rename_with(~paste0(., glue::glue("_{time_str}")))

To selected columns, add them a suffix:

dat %>% rename_at(vars(-numpacie), ~paste0(., "_dataset1"))
# based on a parameter
time_str <- "basal"
dat %>% rename_at(vars(-numpacie), ~paste0(., glue::glue("_{time_str}")))

Remove “_digit” where digit could be 1, 50, 100, …:

dat %>% colnames()
##  [1] "numpacie"   "mort"       "tempsviu"   "edat"       "bmi"       
##  [6] "edatdiag"   "tabac"      "sbp"        "dbp"        "ecg"       
## [11] "chd"        "id"         "fruta_freq"
# first we add suffix to all column names
dat %<>% rename_with(~paste0(., "_", seq_along(.)))
dat %>% colnames()
##  [1] "numpacie_1"    "mort_2"        "tempsviu_3"    "edat_4"       
##  [5] "bmi_5"         "edatdiag_6"    "tabac_7"       "sbp_8"        
##  [9] "dbp_9"         "ecg_10"        "chd_11"        "id_12"        
## [13] "fruta_freq_13"
p_load(stringr)# for str_remove()
dat %>% colnames()
##  [1] "numpacie_1"    "mort_2"        "tempsviu_3"    "edat_4"       
##  [5] "bmi_5"         "edatdiag_6"    "tabac_7"       "sbp_8"        
##  [9] "dbp_9"         "ecg_10"        "chd_11"        "id_12"        
## [13] "fruta_freq_13"
# then we remove "_digit"
dat %<>% rename_with(~str_remove(.,"_\\d+$"), everything())
dat %>% colnames()
##  [1] "numpacie"   "mort"       "tempsviu"   "edat"       "bmi"       
##  [6] "edatdiag"   "tabac"      "sbp"        "dbp"        "ecg"       
## [11] "chd"        "id"         "fruta_freq"
6.4.3.3.2 Prefix

Same workaround as when adding a suffix.

To all columns, add them a prefix:

dat %>% rename_with(~paste0("var_",.))

# based on a parameter
time_str <- "basal"
dat %>% rename_with(~paste0(glue::glue("_{time_str}", .)))

6.5 Column labels

Need to load Hmisc library.

Display current column labels:

p_load(Hmisc)
Hmisc::label(dat)
##   numpacie       mort   tempsviu       edat        bmi   edatdiag      tabac 
##         ""         ""         ""         ""         ""         ""         "" 
##        sbp        dbp        ecg        chd         id fruta_freq 
##         ""         ""         ""         ""         ""         ""

6.5.1 Modify column labels

Workaround is very similar to modifying the column names.

# option 1 -> by referencing column name
Hmisc::label(dat[,"numpacie"]) <- "ID Paciente" 

Hmisc::label(dat,self = F)[names(dat) %in% c("numpacie", "tempsviu")] <- c("ID Paciente", "Tiempo Vivo")
# option 2 -> not recommended, "numpacie" might not always be the first column
Hmisc::label(dat)[1] <- "ID" 

Hmisc::label(dat,self = F)[c(1,3)] <- c("ID Paciente", "Tiempo Vivo")

Modify all column labels:

# labels_cols <- sapply(dat1, Hmisc::label) if you want to have same labels as dat1 in dat
labels_cols <- c("ID", "muerte", "tiempo vivo", "edad", "BMI", "edad diag", "tabaco", "SBP", "DBP", "ECG", "CHD")
Hmisc::label(dat) <- as.list(labels_cols)

6.6 Create Variables

Assume you have a this data set:

  id, time, alive_v1, alive_v2, alive_v3,
  ID1, t0, 1, NA, NA, 
  ID1, t1, NA, 1, NA,
  ID1, t2, NA, NA, 0

You could just simply have:

  id, time, alive,
  ID1, t0, 1,
  ID1, t1, 1,
  ID1, t2, 0
dat %<>% mutate( alive = coalesce(alive_v1, alive_v2, alive_v3)) %>% dplyr::select(-c(alive_v1, alive_v2, alive_v3))

6.6.1 Factor variables

Factor variable with more than 2 levels, you should indicate the order of the levels by specifying levels=c():

# option 1
dat %<>% mutate(fruta_freq=factor(case_when(edat<30 ~ "Nunca", edat <=30 & edat < 40 ~ "A veces", 
                                    edat>=40 & edat<50 ~ "Bastantes veces", 
                                    edat>=50 & edat<60 ~ "Cada dos días", 
                                    edat >= 60 & edat < 70 ~ "Varias al día", 
                                    edat >= 70 ~ "Siempre"), levels=c("Nunca", "A veces", "Bastantes veces", "Cada dos días", "Varias al día", "Siempre")))
# option 2
dat$fruta_freq <- case_when(dat$edat<30 ~ "Nunca", dat$edat <=30 & dat$edat < 40 ~ "A veces", 
          dat$edat>=40 & dat$edat<50 ~ "Bastantes veces", dat$edat>=50 & dat$edat<60 ~ "Cada dos días", dat$edat >= 60 & dat$edat < 70 ~ "Varias al día", dat$edat >= 70 ~ "Siempre")

Based on another factor variable (reduce levels of a factor variable into a new factor variable with less number of levels):

In this example: - Set strings in [tabac] to lower case - If [tabac] starts with “no” or “ex”, level of new variable is “No” - If [tabac] starts with “fumador”, level of new variable is “Si”

p_load(stringr)
dat %<>% mutate(fumador=factor(case_when(str_detect(tolower(tabac), "^no") | str_detect(tolower(tabac), "^ex") ~ "No",
                                         str_detect(tolower(tabac), "^fumador") ~ "Si"), 
                               levels=c("No", "Si")))

6.6.2 Numeric variables

Generate variable that is the difference between two dates: - First create two Date columns - Then create new column that is the difference in days between these Date columns

dat %>% mutate(start_date=sample(seq(as.Date('1999/01/01'), as.Date('2000/01/01'), by="day"), 149), 
               end_date=sample(seq(as.Date('2000/01/02'), as.Date('2000/12/31'), by="day"), 149), 
               duration=as.numeric(difftime(end_date, start_date)))

Assume you have 20 factor variables with levels 1/0 and they all start with “V”. You would like to count the number of 1’s that each row has for all these “V” variables. Imagine that each “V” variable refers to illness type 1, illness type 2, …, illness type 20. In this example, counting the number of 1’s for each row implies knowing how many illnesses each row (patient) has.

In the example, patient with ID1 has 5 illnesses since there are five 1’s in the binary variables that start with “V”.

# na.rm=TRUE -> do the counting even if there is a NA in one of the "V" columns
as.data.frame(matrix(sample(c(0, 1, NA), 149*20, replace = TRUE), ncol=20)) %>% mutate(id=paste0("ID", seq(1:149))) %>% select(id, everything()) %>% mutate(num_illnesses=rowSums(select(., c(starts_with("V"))), na.rm=TRUE)) %>% head()

6.7 Modify Variables

Replace values in rows that contain “Sin especificar”:

In this example, the code replaces “Sin especificar” that is present in any column of the data set

dat <- replace(dat, dat == "Sin especificar", NA)
# could also do: replace(dat, dat == "Sin especificar" | "N/A", NA)

6.7.1 Replace values in column that match condition(s)

Replace value in [edat] for cases where [numpacie] is 1, otherwise [edat] remains unchanged:

# option 1
dat$edat[dat$numpacie==1] <- 10
# option 2
## alternative: dat[which(dat$numpacie==1),]$edat
dat[which(dat$numpacie==1),"edat"] <- 10
# option 3
## before
dat %>% filter(numpacie==1) %>% select(numpacie, edat)
## after
dat %>% mutate(edat = case_when(numpacie==1 ~ 10, numpacie!=1 ~ edat))  %>% filter(numpacie==1) %>% select(numpacie, edat)

By multiple conditions:

# option 1
## before
dat$edat[dat$tabac=="No fumador" & dat$ecg=="Normal"]
##  [1] 10 47 49 54 40 38 44 45 35 44 47 78 71 66 42 67 86 42 60 81 60 45 61 82 50
## [26] 71 69 48 58 52 52 59 49 60 61 54 52 75
## after
dat$edat[dat$tabac=="No fumador" & dat$ecg=="Normal"] <- 10
dat$edat[dat$tabac=="No fumador" & dat$ecg=="Normal"]
##  [1] 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
## [26] 10 10 10 10 10 10 10 10 10 10 10 10 10
# option 3
## before
dat %>% filter(tabac=="No fumador" & ecg=="Normal") %>% select(numpacie, edat)
## after
dat %>% mutate(edat = case_when(tabac=="No fumador" & ecg=="Normal" ~ 10, 
                                !(tabac=="No fumador" & ecg=="Normal") | (is.na(tabac) | is.na(ecg)) ~ edat))  %>% 
  filter(tabac=="No fumador" & ecg=="Normal") %>% select(numpacie, edat)

6.7.2 Forward fill

This is useful when you have clinical variables, such as gender, that does not change over time but you only have them with information for time 0. The example below replaces the NA by the the value of gender that is not NA.

Toy example:

  id, time, gender
  01, 0, male
  01, 1, NA
  01, 2, NA

Now it would be:

  id, time, gender
  01, 0, male
  01, 1, male
  01, 2, male
# option 1
dat %<>% group_by(numpacie) %>% mutate(gender=gender[!is.na(gender)]) %>% ungroup() %>% as.data.frame()
# option 2 -> useful if you want to do it for more than one variable at the same time
dat %>% group_by(numpacie) %>% fill(c("gender"), .direction = "down") %>% ungroup() %>% as.data.frame()

6.7.3 Modify factor variables

Rename levels of a factor variable. In this example, previous levels are “Vivo” and “Muerto” and they are replaced by “alive” and “dead” respectively.

table(dat$mort)
## 
##  Muerto No Info    Vivo 
##      24       5     120
dat %>%  mutate(mort = plyr::revalue(mort, c("Vivo" = "alive", "Muerto" = "dead"))) %>% pull(mort) %>% table()
## .
##   alive    dead No Info 
##     120      24       5

Collapse levels

Collapse levels above 1 to level 1

# option 1
dat$x1 <- fct_collapse(dat$x1, "1"=levels(x1)[-1])
# option 2
dat %>% mutate(x1=fct_collapse(x1, "1"=levels(x1)[-1]))

Collapse string levels

# option 1
dat$tabac <- fct_collapse(dat$tabac, "No fumador"= c("Ex fumador", "No fumador"))

## could also just simply create new variable instead of replacing original variable
dat$fumador <- fct_collapse(dat$tabac, "No fumador"= c("Ex fumador", "No fumador"))
# option 2
dat %>% mutate(tabac=fct_collapse(tabac, "No fumador"= c("Ex fumador", "No fumador")))

## could also just simply create new variable instead of replacing original variable
dat %>% mutate(fumador=fct_collapse(tabac, "No fumador"= c("Ex fumador", "No fumador")))

Change levels for all factors in a data frame

# Change all values of  factors variables to NA or the same value 

for (i in 1:ncol(dat)) {
  if (is.factor(dat[[i]])) {
    levels(dat[[i]])[levels(dat[[i]]) %in% c("Unknown")] <- NA
  }
}

6.7.4 Modify columns using regular expressions

In this example, replaces rows in column [hour] that are not digits by NA:

dat$hour <- sample(c(round(runif(20, min=0, max=23.59), 2), "N/A", "09/09/9999"), 149, replace=TRUE)

# first check if 9999 is in column
any(grepl("9999", dat$hour))
# check if string contains letters such as N, A from N/A
any(grepl("[a-zA-Z]+", dat$hour))

# now replace any values that are not digit with NA
dat$hour <- ifelse(grepl("\\d", dat$hour), dat$hour, NA)

7 Row-wise and column-wise operations

  • Select [numpacie] column and columns with column names that start with “grupo_”
  • For columns with column names that start with “grupo_”, filter rows that have values different than “0”.

For columns with column names that start with “grupo_”, filter rows that have values different than “0” in all of the columns that start with “grupo_”:

dat %>% select(numpacie, starts_with("grupo_")) %>% filter_at(vars(starts_with("grupo_")), all_vars(.!="0"))

For columns with column names that start with “grupo_”, filter rows that have at least a value different than “0” in one of the columns that start with “grupo_”:

dat %>% select(numpacie, starts_with("grupo_")) %>% filter_at(vars(starts_with("grupo_")), any_vars(.!="0"))

7.1 Group by

7.1.1 Access grouped data

dat_grups <- dat %>% group_by(id)
ls_grups <- group_split(dat_grups)
# access first group
ls_grups[[1]]

Now assume you want to access the specific group for ID “1141959”:

res <- ls_grups %>% map(~ filter(.x, id_cas == "1141959"))
res[!sapply(res, function(df) nrow(df) == 0)]

8 Compare data frames

Check which IDs in dataset 1 are not present in dataset 2, assuming that ID codification is the same in both data sets:

setdiff(dat1$id,dat2$id)
dat1 %>% filter(id %in% dat2$id) %>% nrow()

8.1 Compare count of IDs in each data frame

# Calculate counts of IDs in each dataset
counts_dat1 <- table(dat1$id_cas)
counts_dat2 <- table(dat2$id_cas)

# Find all unique IDs across both datasets
all_ids <- union(names(counts_dat1), names(counts_dat2))

# Initialize a data frame to store the results
result_df <- data.frame(ID = all_ids, Count_in_dat1 = 0, Count_in_dat2 = 0)

# Fill in the counts for each dataset
result_df$Count_in_dat1[result_df$ID %in% names(counts_dat1)] <- counts_dat1[result_df$ID %in% names(counts_dat1)]
result_df$Count_in_dat2[result_df$ID %in% names(counts_dat2)] <- counts_dat2[result_df$ID %in% names(counts_dat2)]

8.2 Check for common column names

intersect(names(df1), names(df2))

9 Join data frames

Can join data frames by using merge() or join() functions from dplyr package. In tidyverse, a “merge” is called a “join” which is the term more commonly used in relational databases.. This manual works with join():

This is are most common types of joins:

  • inner_join() corresponds to merge(X, Y) – only retain rows that match between X and Y.

  • left_join() corresponds to merge(X, Y, all.x = T) – retain all rows from X, whether or not they match in Y.

  • right_join() corresponds to merge(X, Y, all.y = T) – retain all rows from Y, whether or not they match in X.

  • full_join() corresponds to merge(X, Y, all = T) – retain all rows from both X and Y, whether or not they match.

# option 1
dat <- full_join(dat1, dat2)
# option 2 using pipe
dat <- dat1 %>% full_join(dat2)

10 Missing data

dat2 <- as.data.frame(matrix(sample(c(0, 1, NA), 149*20, replace = TRUE), ncol=20)) %>% mutate(id=paste0("ID", seq(1:149))) %>% select(id, everything())
head(dat2)

Check if data frame has at least one NA:

any(is.na(dat2))
## [1] TRUE

Count number of missings for each column:

# option 1
sapply(dat2, function(x) sum(is.na(x)))
##  id  V1  V2  V3  V4  V5  V6  V7  V8  V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 
##   0  51  56  44  55  54  49  55  51  49  49  50  39  50  41  48  38  57  44  56 
## V20 
##  58
# option 2
colSums(is.na(dat2))
##  id  V1  V2  V3  V4  V5  V6  V7  V8  V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 
##   0  51  56  44  55  54  49  55  51  49  49  50  39  50  41  48  38  57  44  56 
## V20 
##  58

Count number of missings for each row:

rowSums(is.na(dat2))
##   [1]  5  2  6  2  8  6  7  6  8  5  5  7  9  3  5  8  7  9  4  3  5 11  9  4  6
##  [26] 10  4 10  6  7  8  4 10  5  8  9  3 11 10  6  5  2  2 10  7  5  8  8 13  7
##  [51]  8  7  6  6  7  4  6  4 10 11  6  7  7  6  5  6  7  5  7  7  9  3  9  6 11
##  [76]  9  6  4  4  5 12  7  6  6  8  5  4  4  9  6  6  6  5  7 11  5  9  7  7 12
## [101]  6  4  6  6  9  5  6  7  7 12  8  5  9  7  3  7  8  5  6  9  5  6  7  6  9
## [126]  7  8  5  7 10  6  7  8  6  8  5  5  9  6  9  6  5  7  7  3  4  8  7  5
# alternative -> rowSums(is.na(dat2) | dat2 == "" | dat2 == " ")

# create column that counts NAs per row
# alternative -> dat2 %>% mutate(row_count_na=rowSums(is.na(select(., -id))))
dat2 %>% mutate(row_count_na=rowSums(is.na(select(., everything()))))

11 Generate matrix

Slicing:

  • Select a row: mat[2, ] -> select row 2
  • Select a column: mat[,2] -> select second column
  • Select an element: mat[2,2] -> select element in row 2 and column 2
# option 1
matrix(c(1,1, 2,2), nrow=2, ncol=2)
##      [,1] [,2]
## [1,]    1    2
## [2,]    1    2
as.data.frame(matrix(c(1,1, 2,2), nrow=2, ncol=2))
# option 2 -> matrix(c(col1_vector, col2_vector), 2)
matrix(c(c(1,1), c(2,2)), ncol=2)
##      [,1] [,2]
## [1,]    1    2
## [2,]    1    2
# option 3 -> column wise: cbind(col1_vector, col2_vector)
cbind(c(1,1), c(2,2))
##      [,1] [,2]
## [1,]    1    2
## [2,]    1    2
# option 4 -> row wise: rbind(row1_vector, row2_vector)
rbind(c(1,2), c(1,2))
##      [,1] [,2]
## [1,]    1    2
## [2,]    1    2

12 Plots

12.1 GGPLOT

ggplot using parameters:

var_out <- "bsa"
var_comp <-  "contraste"

ggplot(dat, aes(!!sym(var_out))) + geom_bar(aes(fill = !!sym(var_comp)), position = "dodge")

Add two ggplots in same figure:

# first plot is the density function of var_num grouped by var_fact
plot1 <- ggplot(dat, aes(x = var_num, colour = var_fac)) + geom_density()

# boxplot of var_num grouped by var_fact
plot2 <- ggplot(dat %>% filter(estudi=="Normal") %>% filter(complete.cases(t1_n_siv_ec, edat_cat)), aes(x = edat_cat, y = contraste, fill = edat_cat)) + 
  stat_boxplot(geom = "errorbar",
               width = 0.25) + 
  geom_point(aes(fill = edat_cat), shape = 21,
             position = position_jitterdodge()) +
  geom_boxplot(alpha = 0.3)

library(gridExtra)
# Arrange the plots side by side
grid.arrange(plot1, plot2, ncol=2)

Add several ggplots within a loop in same figure:

# Initialize lblank list
p <- list()

# loop for plot individual graphs and asve in a list
for (i in 1:length(var_long)) {

  datc<- dat_3 %>% filter(Var == var_long[i])

  cat(" \n#####", var_long[i], " \n")

  ### GRAFIC

  p[[i]] <- ggplot(datc, aes_string(x = var_event, y = var_valor)) +

          geom_line(aes_string(group = var_id), alpha = 0.5) +

          stat_summary(aes(group = 1), geom = "line", fun.y = mean,

                       shape = 17, size = 2, col = "red") +

          theme_classic() +

          labs( title = paste("Evolución global", var_long[i]),

                x = NULL,   y = var_valor,    color = NULL)

  print(p[[i]])

  cat("\n")

}
 
# Arrange in a unique plot 
do.call(grid.arrange,p)

Export graphs to pptx

#open eoffice library
library(eoffice)

#export ggplot to pptx file, option append is possible
 
topptx(plot2, filename = file.path( "correla.pptx"))

13 Tables

13.1 Transposed descriptive table

Descriptive table similar to the one obtained with the desc_numeric() function but taking into account several numerical variables at the same time.

complete_obs <- 90
library(glmnet)

varsurv <- names(dat %>% select(rechazo_agudo:dehiscencia_de_sutura_bronquial,-retiro_de_grapas))

vardel <- NAperc(dat[,varsurv],maxNA = 100 - complete_obs)$var
var_surv <- varsurv[!varsurv %in% vardel]

vart<- names(dat %>% select(tiempo_isquemia_pulmon_derecho,tiempo_isquemia_pulmon_izquierdo,dias_uci,dias_intubacion,retiro_de_grapas,dias_ingreso_hosp,tiempo)) 
# tiempo_ecmo_pre solo tiene 1 valor mayor a 0 

vardel <- NAperc(dat[,vart],maxNA = 100 - complete_obs)$var
var_t <- vart[!vart %in% vardel]

var_surv_ord <- var_surv[order(var_surv)]

sumar <- list()
results <- list()
resul <- matrix()
for (i in seq_along(var_t)) {
  sumar[[i]] <- list()
  results[[i]] <- data.frame()
  rows <- c()
  for (j in seq_along(var_surv_ord))  {
    sumar[[i]][[j]] <- summary.quanti(data = dat,x = var_t[i], group = var_surv_ord[j], show.all = F,show.n = F,var.tidy = F, method = "non-param",
                                 nround = 1)
    #sumar[[i]][[j]]$summary$variable <- " "
    #colnames(sumar[[i]][[j]]$summary)[1]  <- c(label(dat[var_surv_ord[j]]))
    sumar[[i]][[j]] <- t(sumar[[i]][[j]]$summary[,-1])
    sumar[[i]][[j]] <- as.data.frame(sumar[[i]][[j]])
    sumar[[i]][[j]] <- tibble::rownames_to_column(sumar[[i]][[j]], " ")
    rows <- c(rows, rep(label(dat[var_surv_ord[j]]),nrow(sumar[[1]][[j]])))
    results[[i]] <- rbind(results[[i]],sumar[[i]][[j]])
  }
  
  resul <- cbind(resul,results[[i]])
}

r <- resul %>% select(-c(` `,resul))
n <- resul$` `

final_re <- cbind(n,r)
colnames(final_re)[1] <- " "

kk <- final_re %>% kable(longtable = T,escape = F, row.names = F, caption = "Summary of results for time variables",align = "c" )  %>%
  kable_styling(latex_options = c("striped","hold_position", "repeat_header"), full_width = F, fixed_thead = T) %>%
  row_spec(0, color = "white", background = "#993489") %>%
  column_spec(which(names(final_re) == " "), width_max = "9em",bold = T) %>%
  pack_rows(index  = table(rows)) %>%
  add_footnote(escape = F, notation = "symbol" )%>%
  row_spec(which(final_re$` ` == "p.value"), bold = T, align = "right")

print(kk)

14 Printing tricks

Use the knitr chunk option results = "asis" to print the html table cat(tab_model()$knitr)

More elaborated code:

pl <- list()

for (i in seq_along(var_long)) {
  cat(" \n####", var_long[i], " \n")
  
  ### GRAFIC
  print(ggplot(dat3_long, aes_string(x = var_event, y = var_long[i])) +
          geom_line(aes_string(group = var_id), alpha = 0.5) +
          stat_summary(aes(group = 1), geom = "line", fun.y = mean,
                       shape = 17, size = 3) +
          theme_bw() +
          labs( title = paste("Evolución global", var_long[i]),
                x = NULL,   y = var_long[i],    color = NULL))
  
  cat("\n")
  ### MODEL 
  
  ctrl <- lmeControl(opt = 'optim')
  frml <- as.formula(paste0(var_long[i], "~", var_event))
  
  ###############################################################    REVISAR AMB CALMA EFECTES ALEATORIS EN CADA ESTUDI ########################################
  model1 <- lme(frml, random = ~ 1 | id_sujeto, data = dat3_long,
                method = "REML", control = ctrl)
  
  model2 <- lme(frml, random = ~ Tiempo | id_sujeto, data = dat3_long,
                method = "REML", control = ctrl)
  
  if (na.omit(anova(model1, model2)$"p-value") < 0.05) {
    model <- model2
  } else {
    model <- model1
  } 
  
  
  cat(tab_model(model)$knitr,"\n--------\n") # THIS TABLE
  cat("\n")
  print(plot_model(model,colors = "#8d35cc",show.values = T) )
  cat("\n")
}

Simple code:

# need this to use tab_model()
library(sjPlot) 
library(sjmisc)
library(sjlabelled)

data(mtcars)
m.mtcars <- lm(mpg ~ cyl + hp + wt, data = mtcars)
cat(tab_model(m.mtcars)$knitr)
  mpg
Predictors Estimates CI p
(Intercept) 38.75 35.09 – 42.41 <0.001
cyl -0.94 -2.07 – 0.19 0.098
hp -0.02 -0.04 – 0.01 0.140
wt -3.17 -4.68 – -1.65 <0.001
Observations 32
R2 / R2 adjusted 0.843 / 0.826