There are a bunch of R functions to manipulate data. We have already seen some examples of them in previous sessions (arithmetic operators). The numeric vectors can be manipulated with classical arithmetic expressions such as adding, subtracting, multiplying and division. However, these examples are minimal compared to the great potential of R to execute simple and complex mathematical operations.
For practical reasons, we cannot review all the arithmetic operations and functions available in R base. Notwithstanding, we will review the most common and relevant ones to extract relevant information from our data.
Let’s load into the R environment our BEDCA dataset created and written during the last session. Previous to starting the file reading, remember to set your environment in the proper working directory or properly declare the path to access and read the BEDCA_dataset.csv file:
## Reading the file for practice
mydata <- read.table("BEDCA_dataset.csv", header = T, sep ="\t", row.names = "food_id")
## Evaluating attributes on my object
class(myadata)
dim(mydata)
## Inspect the recorded variables for every observation in the data frame
colnames(mydata)
Example # 1
One of the variables contained in the dataset records the total amount of kcal. So, let’s evaluate the attributes of this particular vector first:
## Attribute evaluation
mode(mydata$Energy_.kcal.)
length(mydata$Energy_.kcal.)
is.na(mydata$Energy_.kcal.)
In nutrition, sometimes is interesting to declare total energy content in terms of kj (Kilojoules) to adjust to SI unit of energy. Then, and assuming 1 kcal = 4.184 kj, create a new column in the dataset with the standardized measured of energy.
mydata$Energy.kj <- mydata$Energy_.kcal. * 4.184
Example # 2
Now, imagine you’re interested to know the carbohydrate-to-protein ratio for every entry in the dataset:
mydata$Carb2Prot <- mydata$Sugars_total_.g. / mydata$Protein_.g.
Example # 3
If you take a look at the dataset, there are several variables described as different variants of “Fatty acids”. Even more, there are some variables describing the “total” amount discriminated by the level of saturation. However, there is no one variable reporting the global content of Fatty acids, disregarding the saturation or class. So, we need to create a new variable with this information. This can be done in a more elegant manner:
## Detecting the variables to sum up
startsWith(colnames(mydata), "Fatty_acid_total")
## Indexation of the positions in the vector
mylabels <- which(startsWith(colnames(mydata), "Fatty_acid_total"), colnames(mydata))
## Executing the sum of all variables selected
mydata$Total.FA <- rowSums(mydata[,mylabels], na.rm = T)
The above consists of a very small example of how to implement mathematical functions in our dataset to rescue particular information and to create new ones of interest. Below, there is a list of other functions that may be useful for different purposes. Try them for any of the variables present in the BEDCA dataset.
In addition to the above-mentioned strictly mathematical functions, there are several others of equal interest to manipulate your data in different ways. They’re more related to element indexation when exploring them from the mathematical point of view:
Try the above listed functions on our BEDCA dataset to gain experience in their implementation.
Statistical analysis is probably the most valuable asset for using R. In addition to the multiple functions compiled in the base installation, the massive functions implemented in CRAN packages permit to access to a wide array of different mathematical and statistical methods of analysis, making it superior when compared to other similar platforms for data analysis.
In the graph below, there is a wide selection of statistical methods implemented in the R base, and they compile all possible analyses in agreement with your study design (normal/non-normal distributed data, categorical data, two-group or multi-group, correlation or regression).
IMPORTANT: Before reviewing different statistical approaches and functions, it is essential to define the concept of Formula, a recurrent argument requested when exploring data through statistical methods and functions implemented in R.
Formulae are key elements in R statistics. Such notation used is the same for (almost) all functions implemented in R. A formula typically adopts the y ~ model form, where y is the analysed response and model is a set of terms (preferentially categorical but not exclusively) for which some parameters are to be estimated. These terms are separated with arithmetic symbols, each with a particular meaning. For instance:
a + b: expresses the additive effects of a and of b (effect of a variable on the outcome is independent of b variable)
a : b: indicates an interactive effect between a and b variables (effect of the a variable depends on the presence of b variable)
a * b: expresses additive and interactive effects (similar to the a + b + a : b notation)
^ n: it includes all interactions up to level n. For instance, (a + b + c) ^ 2 is equivalent to (a + b + c + a : b + a : c + b : c).
Especial cases. For analyses of variance, aov(), the formula notation accepts a particular syntax to define random effects (covariate control). For instance, y ~ a + Error(b) means the additive effects of the fixed term a and the random one b. This similar control on random effects adopt the following notation in advance analysis models (e.g. linear mixed models): y ~ a + (1 | b).
Here’s a selection of statistical functions installed with the R base and related to measures of central tendency, dispersion and proportions:
BONUS TRACK: for any reason, *R base does not contain a particular function to compute the standard error of the mean (SEM), a statistical parameter frequently used in data science. But there is no problem, as early steps in your future as R developer, you can built from scratch any novel function in R to implement on your data:
sem <- function(x){
sd(x) / sqrt(length(x))
}
Now, your environment should contain a novel Function that you can call as:
sem(x)
Aside from the above-stated basic statistical functions, there is also a large inventory of tests and analyses for contrasting hypotheses and looking for associations and correlations. Most of them return an object of class with the same name (e.g., aov returns an object of class “aov”, glm returns one of class “glm”). The functions we can use to extract the results will act specifically with regard to the object class. These functions are called generic and the most widely used are:
print: returns a brief summary of the outcome
summary: returns a detailed summary of the object generated
coef: returns the estimated coefficients
residuals: returns the computed residuals
Now, let’s try to generate objects from the following functions and explore the results obtained. For such an aim, we will create subsets of data containing two and three categorical variables (“food_group” column name) from mydata data frame (955 obs. x 52 var.) as follows:
# Subset containing two levels from "food_group"
mydatax2var <- subset(mydata, food_group == "Fats_and_oils" | food_group == "Fruits_and_fruit_products")
# Subset containing three levels from "food_group"
mydatax3var <- subset(mydata, food_group == "Fats_and_oils" | food_group == "Fruits_and_fruit_products" | food_group == "Milk_and_milk_products")
Normality tests
# Example
shapiro.test(mydata$Energy.kj)
Two-group comparisons under normality assumption
# Example
t.test(mydatax2var$Energy.kj ~ as.factor(mydatax2var$food_group))
pairwise.t.test(mydatax3var$Energy.kj , as.factor(mydatax3var$food_group), p.adjust = "fdr", na.rm = T, paired = F)
Multiple-group comparisons under normality assumption
Non-parametric two-group comparisons
# Example
wilcox.test(mydatax2var$Energy.kj ~ mydatax2var$food_group)
pairwise.wilcox.test(mydatax3var$Energy.kj, mydatax3var$food_group, exact = F, p.adjust = "fdr")
Non-parametric multiple-group comparisons
# Example
kruskal.test(mydatax3var$Energy.kj, mydatax3var$food_group, exact = F)
Alternative methods to measure group differences
# Example
bartlett.test(Energy.kj ~ food_group, data = mydatax3var)
Methods assessing associations and relationships
# Example
cor.test(mydata$Energy.kj, mydata$Total.FA, method = "pearson")
cor.test(mydata$Energy.kj, mydata$Total.FA, method = "spearman", exact = F)
cor.test(mydata$Energy.kj, mydata$Total.FA, method = "kendall", exact = F)
Assessment of discrete and categorical variables
fisher.test(x, y): it computes the Fisher’s exact to test independence of rows and columns in a contingency table. If data x is a 2 x 2 matrix, then y (factor) is not needed. Additional arhuments to control includes alternative and simulate.p.value to include Monte Carlo simulation. As Fisher test work on computing factorials, it may not work properly for large sample sizes and results computationally less efficient.
chisq.test(x, y): it performs the Goodness-of-fit test. It uses same inputs and control similar arguments than fisher.test (e.g. alternative, simulate.p.value). By default, chisq.test introduces the Yates’s correction to preventing overestimation of statistical significance for small data (one cell count > 5). If the Monte Carlo simulation is activated, the Yates’s correction will be obviated.