Microcredencial “agRo-al” - Session 7 | Back to HOME


Basic arithmetics on numeric objects

There are a bunch of R functions to manipulate data. We have already seen some examples of them in previous sessions (arithmetic operators). The numeric vectors can be manipulated with classical arithmetic expressions such as adding, subtracting, multiplying and division. However, these examples are minimal compared to the great potential of R to execute simple and complex mathematical operations.

For practical reasons, we cannot review all the arithmetic operations and functions available in R base. Notwithstanding, we will review the most common and relevant ones to extract relevant information from our data.

Let’s load into the R environment our BEDCA dataset created and written during the last session. Previous to starting the file reading, remember to set your environment in the proper working directory or properly declare the path to access and read the BEDCA_dataset.csv file:

## Reading the file for practice
mydata <- read.table("BEDCA_dataset.csv", header = T, sep ="\t", row.names = "food_id")

## Evaluating attributes on my object
class(myadata)
dim(mydata)

## Inspect the recorded variables for every observation in the data frame
colnames(mydata)

Example # 1

One of the variables contained in the dataset records the total amount of kcal. So, let’s evaluate the attributes of this particular vector first:

## Attribute evaluation
mode(mydata$Energy_.kcal.)

length(mydata$Energy_.kcal.)

is.na(mydata$Energy_.kcal.)

In nutrition, sometimes is interesting to declare total energy content in terms of kj (Kilojoules) to adjust to SI unit of energy. Then, and assuming 1 kcal = 4.184 kj, create a new column in the dataset with the standardized measured of energy.

mydata$Energy.kj <- mydata$Energy_.kcal. * 4.184

Example # 2

Now, imagine you’re interested to know the carbohydrate-to-protein ratio for every entry in the dataset:

mydata$Carb2Prot <- mydata$Sugars_total_.g. / mydata$Protein_.g.

Example # 3

If you take a look at the dataset, there are several variables described as different variants of “Fatty acids”. Even more, there are some variables describing the “total” amount discriminated by the level of saturation. However, there is no one variable reporting the global content of Fatty acids, disregarding the saturation or class. So, we need to create a new variable with this information. This can be done in a more elegant manner:

## Detecting the variables to sum up
startsWith(colnames(mydata), "Fatty_acid_total")

## Indexation of the positions in the vector
mylabels <- which(startsWith(colnames(mydata), "Fatty_acid_total"), colnames(mydata))

## Executing the sum of all variables selected
mydata$Total.FA <- rowSums(mydata[,mylabels], na.rm = T)

The above consists of a very small example of how to implement mathematical functions in our dataset to rescue particular information and to create new ones of interest. Below, there is a list of other functions that may be useful for different purposes. Try them for any of the variables present in the BEDCA dataset.

  • sum(x): sum of the x’s elements.
  • prod(x): product of the x’s elements.
  • log(x, base): computes the logarithm of x’s elements with a given base.
  • log10(x): base 10 logarithm of the x’s elements.
  • log2(x): base 2 logarithm of the x’s elements.
  • sqrt(x): square-root of the x’s elements.
  • abs(x): computes the absolute value of x’s elements.
  • cos(x), sin(x), and tan(x): trigonometric functions for computing cosine, sine, and tangent of x’s elements.

Another simple functions of interest

In addition to the above-mentioned strictly mathematical functions, there are several others of equal interest to manipulate your data in different ways. They’re more related to element indexation when exploring them from the mathematical point of view:

  • max(x): maximum or the largest of the x’s elements.
  • min(x): minimum or the smallest of the x’s elements.
  • which.max(x): returns the index of the greatest x’s element.
  • which.min(x): returns the index of the smallest x’s element.
  • round(x, n): it rounds the elements of x to n decimals positions.
  • rev(x): reverses the x’s elements.
  • sort(x): sorts the x’s elements in increasing order, for ordering in decreasing order, rev(sort(x)).
  • rank(x): ranks the numbers (in increasing order) in the x vector.
  • rank(-x): ranks the numbers (in decreasing order) in the x vector.
  • which(x == y): returns a vector of arithmetic or logical comparisons according to the operator type used.
  • match(x, y): returns a vector of the same length than x with its elements being in y.
  • na.omit(x): removes the observations with missing data (NA).

Try the above listed functions on our BEDCA dataset to gain experience in their implementation.


Statistical functions

Statistical analysis is probably the most valuable asset for using R. In addition to the multiple functions compiled in the base installation, the massive functions implemented in CRAN packages permit to access to a wide array of different mathematical and statistical methods of analysis, making it superior when compared to other similar platforms for data analysis.

In the graph below, there is a wide selection of statistical methods implemented in the R base, and they compile all possible analyses in agreement with your study design (normal/non-normal distributed data, categorical data, two-group or multi-group, correlation or regression).

drawing


IMPORTANT: Before reviewing different statistical approaches and functions, it is essential to define the concept of Formula, a recurrent argument requested when exploring data through statistical methods and functions implemented in R.

Formulae are key elements in R statistics. Such notation used is the same for (almost) all functions implemented in R. A formula typically adopts the y ~ model form, where y is the analysed response and model is a set of terms (preferentially categorical but not exclusively) for which some parameters are to be estimated. These terms are separated with arithmetic symbols, each with a particular meaning. For instance:

  • a + b: expresses the additive effects of a and of b (effect of a variable on the outcome is independent of b variable)

  • a : b: indicates an interactive effect between a and b variables (effect of the a variable depends on the presence of b variable)

  • a * b: expresses additive and interactive effects (similar to the a + b + a : b notation)

  • ^ n: it includes all interactions up to level n. For instance, (a + b + c) ^ 2 is equivalent to (a + b + c + a : b + a : c + b : c).

Especial cases. For analyses of variance, aov(), the formula notation accepts a particular syntax to define random effects (covariate control). For instance, y ~ a + Error(b) means the additive effects of the fixed term a and the random one b. This similar control on random effects adopt the following notation in advance analysis models (e.g. linear mixed models): y ~ a + (1 | b).


Here’s a selection of statistical functions installed with the R base and related to measures of central tendency, dispersion and proportions:

  • mean(x): mean or average of the x’s elements.
  • median(x): median of the x’s elements.
  • sd(x): computes the standard deviation of the x’s elements.
  • var(x): calculates the variance across the x’s elements.
  • scale(x): centers and reduces the data. center or scale can be modified accordingly. In summary, it computes standard scores (z-scores).
  • quantile(x): computes the 0th, 25th, 50th, 75th, and 100th percentiles of the x elements distribution.
  • confint(x): calculates the confidence intervals for the x elements distribution. Uses linear model (lm) by default.

BONUS TRACK: for any reason, *R base does not contain a particular function to compute the standard error of the mean (SEM), a statistical parameter frequently used in data science. But there is no problem, as early steps in your future as R developer, you can built from scratch any novel function in R to implement on your data:

sem <- function(x){
                     sd(x) / sqrt(length(x))
                     }

Now, your environment should contain a novel Function that you can call as:

sem(x)


Aside from the above-stated basic statistical functions, there is also a large inventory of tests and analyses for contrasting hypotheses and looking for associations and correlations. Most of them return an object of class with the same name (e.g., aov returns an object of class “aov”, glm returns one of class “glm”). The functions we can use to extract the results will act specifically with regard to the object class. These functions are called generic and the most widely used are:

  • print: returns a brief summary of the outcome

  • summary: returns a detailed summary of the object generated

  • coef: returns the estimated coefficients

  • residuals: returns the computed residuals

Now, let’s try to generate objects from the following functions and explore the results obtained. For such an aim, we will create subsets of data containing two and three categorical variables (“food_group” column name) from mydata data frame (955 obs. x 52 var.) as follows:

# Subset containing two levels from "food_group"
mydatax2var <-  subset(mydata, food_group == "Fats_and_oils" | food_group == "Fruits_and_fruit_products")

# Subset containing three levels from "food_group"
mydatax3var <-  subset(mydata, food_group == "Fats_and_oils" | food_group == "Fruits_and_fruit_products" | food_group == "Milk_and_milk_products")

Normality tests

  • shapiro.test(x): computes the Shapiro-Wilk normality test on the x vector of elements.
  • ks.test(x, y): perform the Kolmogorov-Smirnov test on x elements under the y cumulative distribution (“pnorm” by default).
# Example
shapiro.test(mydata$Energy.kj)

Two-group comparisons under normality assumption

  • t.test(x, y): under normal distributed x elements, it computes the Student’s t-Test of x elements across a y grouping variable. Other central arguments to setup are: alternative, paired, var.equal (see help for more details).
  • pairwise.t.test(x, y): assuming normality of the x elements, it calculates pairwise comparisons of x elements between all possible group level (y) pairs. It is important to define also the following p.adjust, paired, alternative function arguments.
# Example
t.test(mydatax2var$Energy.kj ~ as.factor(mydatax2var$food_group))

pairwise.t.test(mydatax3var$Energy.kj , as.factor(mydatax3var$food_group), p.adjust = "fdr", na.rm = T, paired = F)

Multiple-group comparisons under normality assumption

  • aov(x ~ y): it performs an analysis of variance of x elements across y grouping variable based on a linear model. The formula (x ~ y) can adopt any optional model including covariation (x ~ y + a * z). The data argument must be defined.

Non-parametric two-group comparisons

  • wilcox.test(x, y): assuming non-parametric x distribution, it computes rank-based comparison of x elements between y groups. Also know as Mann-Whitney test. Other arguments to define are alternative, paired (defining execution of Rank Sum-unpaired or Signed Rank-paired tests), and exact if ties are present in x.
  • pairwise.wilcox.test(): when x is non-parametrically distributed, it computes Wilcoxon test on each pairwise comparisons of x elements between all possible group level (y) pairs. It is important to define also the following p.adjust, paired, alternative, and exact function arguments.
# Example
wilcox.test(mydatax2var$Energy.kj ~ mydatax2var$food_group)

pairwise.wilcox.test(mydatax3var$Energy.kj, mydatax3var$food_group, exact = F, p.adjust = "fdr")

Non-parametric multiple-group comparisons

  • kruskal.test(x, y): it performs the Kruskal-Wallis rank sum test to detect x mean differences across more tha two groups (y). Besides the formula, the data argument is also required.
# Example
kruskal.test(mydatax3var$Energy.kj, mydatax3var$food_group, exact = F)

Alternative methods to measure group differences

  • bartlett.test(x ~ y): it computes the Bartlett’s test of homogeneity of x variances across y groups. As the x and y arguments adopt a formula mode, data needs to be declared.
# Example
bartlett.test(Energy.kj ~ food_group, data = mydatax3var)

Methods assessing associations and relationships

  • cor.test(x, y): it performs regression and correlation analyses between paired samples. The alternative argument must be declared at convenience. The method argument will vary from declaring “pearson (r)”, “spearman (rho)”, or “kendall (tau)”, these last two for non-parametric data also demanding exact declaration for ties control.
# Example
cor.test(mydata$Energy.kj, mydata$Total.FA, method = "pearson")

cor.test(mydata$Energy.kj, mydata$Total.FA, method = "spearman", exact = F)

cor.test(mydata$Energy.kj, mydata$Total.FA, method = "kendall", exact = F)

Assessment of discrete and categorical variables

  • fisher.test(x, y): it computes the Fisher’s exact to test independence of rows and columns in a contingency table. If data x is a 2 x 2 matrix, then y (factor) is not needed. Additional arhuments to control includes alternative and simulate.p.value to include Monte Carlo simulation. As Fisher test work on computing factorials, it may not work properly for large sample sizes and results computationally less efficient.

  • chisq.test(x, y): it performs the Goodness-of-fit test. It uses same inputs and control similar arguments than fisher.test (e.g. alternative, simulate.p.value). By default, chisq.test introduces the Yates’s correction to preventing overestimation of statistical significance for small data (one cell count > 5). If the Monte Carlo simulation is activated, the Yates’s correction will be obviated.