Microcredencial “agRo-al” - Session 5 | Back to HOME

Reading external datasets

Reading external data produced in third-party resources is the regular practice when analyzing small, medium, or large-scale datasets in R. In this case, two major considerations must be taken into account:

☑ Get the information in a pre-formatted manner according to the R objects structure

☑ Access to the pre-formatted information from a pre-defined working directory

Regarding the last consideration, for reading and writing files in R, we need to know our current working directory. To find the complete path to our working directory, we can use the following R function:

getwd() ## get working directory

The resulting path is where your data must be located so R can read it. However, the working directory can be changed with the function:

setwd() ## set working directory

Declaring as function argument the path of your choice between quotes. To centralize all the practical file exchange work from now on, we will create in your respective OS a folder named “agroal”. This is:

for Windows users

C:/Users/your_user/agroal

for Linux/Ubuntu users

/home/your_user/agroal

for Mac users

/Users/your_user/agroal

Once the working directory is created, this “agroal” folder will be the local repository for downloading, loading, writing and reading all files related to these R training sessions. Accordingly, from now on before starting every hands-on session you must position your R environment in this working directory as follows:

setwd("C:/Users/your_user/agroal") ## For Windows users

setwd("/home/your_user/agroal") ## For Linux users

setwd("/Users/your_user/agroal") ## For Mac users

If the working directory issue is understood, now we must pay attention to the file data. R can read and load datasets stored in ASCII ¹ text. For reading external datasets, R has the read.table function that implicitly recognize data frame objects (see Session #3). The read.table() function has different variants differing in the default arguments declared to read the external data file. Take a look into the RStudio console and explore arguments for every of the following equivalent functions:

read.table()

read.csv()

read.csv2()

read.delim()

read.delim2()

GEEK NOTE: R is also able to read, for instance, Excel files. As this functionality is very useful for a more advanced use of R, we will eventually explore latter on (Session #6).

For practical reasons we will consider the function read.table() as the main function to read our pre-formatted external data. It creates a data frame, and so is the main way to read data in tabular form. From the large number of arguments to define in the read.table() function (check it out with help command) the most critical ones to set are:

file: the name of the file (including path if needed) between quotes “”
header: logical value TRUE or FALSE
sep: character used as field delimiter (semicolon ‘;’ | comma ‘,’ | colon ‘:’ | tab ‘\t’ | white space ’ ’)
dec: character used for the decimal point
col.names: a character vector with the variable names (if no header disclosed, then by default: V1, V2, V3…)
row.names: a character vector with the names of the observations between quotes “”
as.is: logical (TRUE or FALSE) argument to control the conversion of character variables as factors (if TRUE).

Before proceeding to test read.table() function, we need to download a valid dataset to read. So, for practical reasons we will download the BEDCA dataset (NutrienTrackeR library ²) and save our respective agroal working directory using a specific R function as follows:

## Declare the URL of the file to download
url <- "https://github.com/agRo-al/agro-al.github.io/raw/refs/heads/main/BEDCA_dataset"

## Set the name and location where you want to save it (agroal working directory) 
file_path <- "/home/your_user/agroal/BEDCA_dataset.csv" ## Path according to your OS

## Execute the download.file() function, passing the above set arguments
download.file(url, file_path)

Verify the download was successful by inspecting the file’s existence in the working folder. At this point, we can now launch the read.table function following the next code:

mydata <- read.table("BEDCA_dataset.csv", 
                     header = TRUE,
                     sep = "\t",
                     row.names = "food_id",
                     as.is = TRUE
                     )

Once the dataset is successfully read by R, it is time to explore some attributes of the variables contained in the data frame and of the object as a whole. You’re open to using some of the functions already seen in previous sessions.

R datasets for training purposes

A library containing several datasets for training purposes is installed aside R. The library called datasets comprises more than 90 different datasets containing pre-formatted information concerning investigation in life, social and engineering sciences. To take a look into details of the datasets library (content and description), call the following function:

library(help = "datasets")

Then, the main documentation of this library will pop-up in a new tab of the RStudio text editor panel, where you will inspect all the available datasets to use in training sessions from now on. If you inquire on the internet and third-party resources for training, you will find out that two of those datasets are widely used across training resources, thus being mtcars and iris. You can explore these or any other datasets stored in the datasets library by typing their respective names in the console:

mtcars

iris

Now, you already have enough knowledge to explore the attributes of these datasets as well as check the main variables (factor, numeric, character, etc.) contained in. Also, assess distributions and basic statistics using sole variables or combined (see Session #4 content). Take the next 10 minutes to play with them.

IMPORTANT: if you do not have your own dataset for the final evaluation of this training course, the content of the datasets library would be the preferred source of information to examine your acquired skills by the end of all sessions.

Writing files

Similarly to reading external files to convert them into R objects, you can also write and save them in your system for different goals.Typically, you will be interested in saving R objects like data frames and matrices, but not exclusively, making it possible to save vectors, characters, etc. For such an aim, the function write.table writes in a file the R object generated though different operations and processing. The main arguments of the write.table function to control a proper execution of it are:

x: the name of the R object to be written
file: the file’s name (and path), between quotes, of how you desire it be named and stored in the OS.
quote: a logical vector. When TRUE the variables of mode character and the factors are written within. If numeric variables are present, it is recommended declare this vector as FALSE.
sep: the field separator used in the file, stated between quotes. Regular TAB should be declared as “.
row.names: logical vector, indicating whether the line names should be written in the file.
col.names: logical vector, for including columns names.

Getting practical the acquired skills

Using any of the available datasets loaded into the R environment (BEDCA_dataset.csv, mtcars, or iris), create a new R object, selecting two categorical (factor) and two numeric variables of those contained in the dataset selected. When created, save this new object in your working folder and name it “myoutput.csv”. Then you can go via your OS file explorer and inspect if what you saved corresponds to the envisioned.