7 Data in R
7.1 Data Types
R
can process a wide array of data types, but a key point to understand is that since it needs to handle different data types in different ways it will store them differently too.
There are 5 main data types:
- doubles/numerics: standard numbers e.g. 3.14
- integers: whole numbers without decimal places eg. 1 but not 1.0 (and written as
1L
to specify integer status) - complex: These you can pretty much ignore. This is dealing with things like imaginary numbers.
- logical: These are boolean values of
TRUE
andFALSE
that are encoded as1
and0
respectively - character: These are strings of text e.g.
word
orthis is a sentence
. When specifying these inR
they need be be enclosed in quotation marks like"word"
or'word'
.
We can find the type of data something is stored as in R
with the typeof()
function, but for the majority of purposes it is better to know the class of data as that is the usual way R
will communicate it to you. To do this we use the class()
function:
class(3.14)
## [1] "numeric"
class(1L)
## [1] "integer"
class(1+1i)
## [1] "complex"
class(TRUE)
## [1] "logical"
class('banana')
## [1] "character"
7.2 Type Coercion
Data types/classes are important because we need to handle different types of data differently. For exampe, we can add two numeric values together, or a numeric and an integer, but we can’t add a numeric and a character together. 10 + "apple"
is nonsense, and R
treats it that way. This enforced strictness is important, but it has some drawbacks to be aware of. The most important one is that all data in a single vector must be the same type. If you have a mix of values then everything will be converted to the “simplest” data type according to the following rule:
A vector in R
is essentially just an ordered list of things, with the special condition that everything in the vector must be the same basic data type. We can create a vector of values using the c()
function:
my_vec <- c(2,6,3)
my_vec
## [1] 2 6 3
Given what we’ve learned so far, what do you think the following will produce?
vec1 <- c(2,6,'3')
vec2 <- c("apple", 2.1, TRUE)
vec3 <- c(2, 2.0, 2L)
You can try to force coercion against this flow using the as.*()
functions. Not everything is possible, but it useful to remember for if data is read in incorrectly by R
(like numerics as a character string).
as.numeric(c('0','2','4'))
## [1] 0 2 4
as.logical(1)
## [1] TRUE
as.logical(0)
## [1] FALSE
as.logical(-0.5)
## [1] TRUE
as.logical("house")
## [1] NA
As you can see, some surprising things can happen when R
forces one basic data type into another! Nitty-gritty of type coercion aside, the point is: if your data doesn’t look like what you thought it was going to look like, type coercion may well be to blame.
7.3 Data Structures
Now that we understand data types it is time to move on to the data structures that R
uses to store data. The three data structures we will cover in this course are vectors, data frames, and lists. There are other data structures (like matrices and arrays) that we wont cover, but similar principles apply.
- Vectors are a one-dimensional sequence of data elements. Every element in a vector must be the same data type or it will undergo type coercion
- Lists are a collection of elements. Each element can be any type of
R
object (vector, data frame, a single value, even another list). - Data frames are a two-dimensional table of data elements. Each column is a vector (so must be the same data type), while each row is a list (so can contain different data types)
7.3.1 Vectors
We’ve alredy covered how to create a basic vector, so now we will cover how to manipulate the vector.
The c()
function can also append things to an existing vector:
ab_vector <- c('a', 'b')
ab_vector
## [1] "a" "b"
concat_example <- c(ab_vector, 'SWC')
concat_example
## [1] "a" "b" "SWC"
You can also create vectors of a series of numbers using more efficient methods. The :
operator creates a vector of numbers from the first number to the second number by steps of 1.
1:10
## [1] 1 2 3 4 5 6 7 8 9 10
1.1:9.9
## [1] 1.1 2.1 3.1 4.1 5.1 6.1 7.1 8.1 9.1
The seq()
function lets you create a sequence of numbers with a specified step value:
seq(from = 1,
to = 10,
by=0.1)
## [1] 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0 2.1 2.2 2.3
## [15] 2.4 2.5 2.6 2.7 2.8 2.9 3.0 3.1 3.2 3.3 3.4 3.5 3.6 3.7
## [29] 3.8 3.9 4.0 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 5.0 5.1
## [43] 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9 6.0 6.1 6.2 6.3 6.4 6.5
## [57] 6.6 6.7 6.8 6.9 7.0 7.1 7.2 7.3 7.4 7.5 7.6 7.7 7.8 7.9
## [71] 8.0 8.1 8.2 8.3 8.4 8.5 8.6 8.7 8.8 8.9 9.0 9.1 9.2 9.3
## [85] 9.4 9.5 9.6 9.7 9.8 9.9 10.0
7.3.1.1 Vector Subsetting
To subset a vector we use what is known as square bracket notation []
. The individual elements in a vector are ordered, so we can call for specific elements directly by placing the index inside []
.
my_vec <- c(1,3,5,6,10)
my_vec[3]
## [1] 5
my_vec[c(2,4)]
## [1] 3 6
Instead of asking for specific elements of a vector by index you can ask R
to return any values that meet a specific criteria. We do this by placing a logical/boolean test in []
in place of an index.
my_vec <- 1:10
my_vec[my_vec > 8] # Return values > 8
## [1] 9 10
my_vec[my_vec %% 2 == 0] # Return even numbers only
## [1] 2 4 6 8 10
In addition to asking for elements of a vector with the square bracket notation, we can ask a few other questions about vectors:
my_vec <- seq(0, 100, 0.1)
## Find out how long the vector is
length(my_vec)
## [1] 1001
## Show only the start of a vector
head(my_vec)
## [1] 0.0 0.1 0.2 0.3 0.4 0.5
## Show only the end of a vector
tail(my_vec)
## [1] 99.5 99.6 99.7 99.8 99.9 100.0
Finally, you can give names to elements in your vector and subset by those:
name_vec <- 5:9
names(name_vec) <- c("a", "b", "c", "d", "e")
name_vec
## a b c d e
## 5 6 7 8 9
name_vec["a"]
## a
## 5
name_vec[c("a", "b")]
## a b
## 5 6
###################
### Challenge 9 ###
###################
# Given the following lines of code:
# x <- 1:5
# names(x) <- letters[1:5]
# x
# Find at least five different commands to come up with the following subset:
# b c d
# 2 3 4
# Fictional bonus points for anyone who figures out the %in% operator!
7.4 Lists
While everything in a vector has to be the same data type, a list is a really useful data structure to know since you can fill it with anything.
list_example <- list(1, "a", TRUE, 1+4i)
list_example
## [[1]]
## [1] 1
##
## [[2]]
## [1] "a"
##
## [[3]]
## [1] TRUE
##
## [[4]]
## [1] 1+4i
another_list <- list(title = "Research Bazaar", numbers = 1:10, data = TRUE )
another_list
## $title
## [1] "Research Bazaar"
##
## $numbers
## [1] 1 2 3 4 5 6 7 8 9 10
##
## $data
## [1] TRUE
To subset a list we still use square bracket notation, but the syntax here can be confusing at first. Standard []
subsetting will return the specified element of a list as a list of one element rather than extracting the element itself. For example:
another_list[1]
## $title
## [1] "Research Bazaar"
To extract the actual element of the list we need to use double bracket notation [[]]
instead. Alternatively, in lists with named elements like this one you can call a specific list element by name with the $
operator.
another_list[[1]]
## [1] "Research Bazaar"
another_list$title
## [1] "Research Bazaar"
When you extracted the element of a list with double square bracket notation you can further subset it like you would normally with single bracket notation e.g. [[]][]
.
####################
### Challenge 10 ###
####################
# Using the following code:
# challenge_list <- list(words = c("alpha", "beta", "gamma"),
# numbers = 1:10,
# letter = letters)
# challenge_list
# Extract the following things:
# - The word "gamma"
# - The letters "a", "e", "i", "o", and "u"
# - The numbers less than or equal to 3
# More fictional bonus points if you use a different methods!
7.5 Data Frames
Data frames are two-dimensional data structures and will probably be the most common one you use in your own analysis. Most functions for loading data into R
from file (like read.csv()
) will turn it into a data.frame
by default.
Let’s start by making a toy dataset in your data/
directory, called feline.csv
. Copy the following lines of data, open a new text file in RStudio
with File > New File > Text File
, paste the data, and save it to the appropriate directory.
coat,weight,likes_string
calico,2.1,TRUE
black,5.0,FALSE
tabby,3.2,TRUE
We can load this into R
via the following:
cats <- read.csv(file = "data/feline.csv")
cats
## coat weight likes_string
## 1 calico 2.1 TRUE
## 2 black 5.0 FALSE
## 3 tabby 3.2 TRUE
Each column in a data frame is a vector (same data type), and each row is a list (different data types). We can look at the structure of a data frame using the str()
function.
str(cats)
## 'data.frame': 3 obs. of 3 variables:
## $ coat : Factor w/ 3 levels "black","calico",..: 2 1 3
## $ weight : num 2.1 5 3.2
## $ likes_string: logi TRUE FALSE TRUE
We can begin exploring our dataset right away, pulling out columns and rows or combinations thereof. To extract a single column from the data you use the $
operator with this syntax data_name$column_name
.
cats$weight
## [1] 2.1 5.0 3.2
Since a column is a vector we can further subset this with []
:
## Just the first element of the weight column
cats$weight[1]
## [1] 2.1
## Just the second element of the weight column
cats$weight[2]
## [1] 5
## Add the two previous values together
cats$weight[1] + cats$weight[2]
## [1] 7.1
If we want to subset the full cats
dataset then we need to specify the element/s we want to extract in two dimensions (rows and columns, in that order). This uses the following square bracket syntax [row_id, column_id]
. If you want to subset in one dimension only and keep all of the other (e.g. first row of every column) then you just keep one dimension empty in the square brackets e.g. [row_id, ]
. For example:
## Extract the first row
cats[1, ]
## coat weight likes_string
## 1 calico 2.1 TRUE
## Extract the second column
cats[ , 2]
## [1] 2.1 5.0 3.2
## Extract the value for the second row in the third column
cats[2, 3]
## [1] FALSE
To highlight the difference vectors and lists, lets try and add a new row of data to the cats
data frame.
garfield <- c("marmalade", 20, FALSE)
garfield
## [1] "marmalade" "20" "FALSE"
If we create the new row as a vector then type coercion kicks in and we no longer have the data in the correct format! However, if we use a list:
garfield <- list("marmalade", 20, FALSE)
garfield
## [[1]]
## [1] "marmalade"
##
## [[2]]
## [1] 20
##
## [[3]]
## [1] FALSE
To add a new row to a data frame we can use the rbind()
(row bind) function.
cats2 <- rbind(cats, garfield)
## Warning in `[<-.factor`(`*tmp*`, ri, value = "marmalade"): invalid factor
## level, NA generated
But now why didn’t this work?
7.6 Factors
Another important data structure is called a factor. Factors usually look like character data but are stored as integers with a look-up table. They are important for representing categorical information for statistical analysis. Lets take a closer look at the coat
column using str()
:
str(cats$coat)
## Factor w/ 3 levels "black","calico",..: 2 1 3
Factors make use of a look-up table to convert the numbers back to characters. In this case every 1
refers to "calico"
. This means that you can’t add new data that doesn’t match the existing factor levels because R
doesn’t understand how to handle to data. It only knows what values correspond to 1,2, and 3. "marmalade"
could be anything else! To get passed this we need to tell R
that we want an extra factor level called "marmalade"
, and we do this with the levels()
command.
# Existing levels
levels(cats$coat)
## [1] "black" "calico" "tabby"
# Lets add a new level
levels(cats$coat) <- c(levels(cats$coat), "marmalade")
# Now lets see the new levels
levels(cats$coat)
## [1] "black" "calico" "tabby" "marmalade"
While factors are essential for statistical modelling they can’t be a nuisance in other instances. R
will load all character data as factors by default, but we can tell it not to.
####################
### Challenge 11 ###
####################
# Look thorugh the help file for the read.csv() command to find an argument to stop character data from being loaded as factors. Hint: Characters are sometimes referred to as strings.
# Reload the cats data frame from file without factors
# Add the new row of Garfield data to the data frame
####################
### Challenge 12 ###
####################
# Create a list of length two containing a character vector for each of the sections in this part of the workshop:
# - Data types
# - Data structures
# Populate each character vector with the names of the data types and data structures we've seen so far.