1  Data Structure and R Programming

Data types, operators, variables

Two basic types of objects: (1) data & (2) functions

1.1 Data type

  • Boolean/Logical: Yes or No, Head or Tail, True or False

  • Integers: Whole numbers \(\mathbb{Z}\), e.g., 1, 2, 3, -1, -2, -3

  • Characters: Text strings, e.g., “Hello”, “World.”

  • Floats: Noninteger fractional numbers, e.g., \(\pi\), \(e\).

  • Missing data: NA in R, which stands for “Not Available.” It is used to represent missing or undefined values in a dataset.

day <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday")
weather <- c("Raining", "Sunny", NA, "Windy", "Snowing")
data.frame(rbind(day, weather))
             X1      X2        X3       X4      X5
day      Monday Tuesday Wednesday Thursday  Friday
weather Raining   Sunny      <NA>    Windy Snowing
  • Other more complex types

1.1.1 To change data type

You may change the data type using the following functions, but the chance is that some of the information will be missing. Do this with caution!

x <- pi
print(x)
[1] 3.141593
x_int <- as.integer(x)
print(x_int)
[1] 3

Some of the conversion functions:

  • as.integer(): Convert to integer.
  • as.numeric(): Convert to numeric (float).
  • as.character(): Convert to character.
  • as.logical(): Convert to logical (boolean).
  • as.Date(): Convert to date.
  • as.factor(): Convert to factor (categorical variable).
  • as.list(): Convert to list.
  • as.matrix(): Convert to matrix.
  • as.data.frame(): Convert to data frame.
  • as.vector(): Convert to vector.
  • as.complex(): Convert to complex number.

1.2 Operators

  • Unary: With only one argument. E.g., -x (negation), !x (logical negation).

  • Binary: With two arguments. E.g., x + y (addition), x - y (subtraction), x * y (multiplication), x / y (division).

1.2.1 Comparison Operator

Comparing two objects. E.g., x == y (equal), x != y (not equal), x < y (less than), x > y (greater than), x <= y (less than or equal to), x >= y (greater than or equal to).

1.2.2 Logical Operator

Logical operators are used to combine or manipulate logical values (TRUE or FALSE). E.g., x & y (logical AND), x | y (logical OR), !x (logical NOT).

We shall note that the logical operators in R are vectorized, x | y and x || y are different. The former is vectorized, while the latter is not.

x <- c(TRUE, FALSE, FALSE)
y <- c(TRUE, FALSE, FALSE)
x | y  # [1]  TRUE FALSE FALSE
x || y # This will return an error

1.3 Indexing

Indexing is a way to access or modify specific elements in a data structure. In R, indexing can be done using square brackets [] for vectors and matrices, or the $ operator for data frames. Note that the index starts from 0 in R, which is different from some other programming languages like Python.

1.4 Naming

In R, you can assign names to objects using the names() function. This is useful for making your code more readable and for accessing specific elements in a data structure.

A good practice is to use _ (underscore) to separate words in variable names, e.g., my_variable. This makes the code more readable and easier to understand.

# Assign names to a vector
temp <- c(20, 30, 27, 31, 45)
names(temp) <- c("Mon", "Tues", "Wed", "Thurs", "Fri")
print(temp)
  Mon  Tues   Wed Thurs   Fri 
   20    30    27    31    45 
rownames(temp) <- "Day1" # error
temp_mat <- matrix(c(20, 30, 27, 31, 45), nrow = 1, ncol = 5)
colnames(temp_mat) <- c("Mon", "Tues", "Wed", "Thurs", "Fri")
rownames(temp_mat) <- "Day1" # error
print(temp_mat)
     Mon Tues Wed Thurs Fri
Day1  20   30  27    31  45

1.5 Array and Matrix

One may define an array or a matrix in R using the array() or matrix() functions, respectively. An array is a multi-dimensional data structure, while a matrix is a two-dimensional array.

# Create a 1-dimensional array
array_1d <- array(1:10, dim = 10)
array_1d
 [1]  1  2  3  4  5  6  7  8  9 10
# Create a 2-dimensional array
array_2d <- array(1:12, dim = c(4, 3))
array_2d
     [,1] [,2] [,3]
[1,]    1    5    9
[2,]    2    6   10
[3,]    3    7   11
[4,]    4    8   12
# Create a 3-dimensional array
array_3d <- array(1:24, dim = c(4, 3, 2))
array_3d
, , 1

     [,1] [,2] [,3]
[1,]    1    5    9
[2,]    2    6   10
[3,]    3    7   11
[4,]    4    8   12

, , 2

     [,1] [,2] [,3]
[1,]   13   17   21
[2,]   14   18   22
[3,]   15   19   23
[4,]   16   20   24
# Create a matrix
my_matrix <- matrix(1:12, nrow = 4, ncol = 3)
my_matrix
     [,1] [,2] [,3]
[1,]    1    5    9
[2,]    2    6   10
[3,]    3    7   11
[4,]    4    8   12

Note here, the matrix is a special case of an array, where the number of dimensions is exactly 2.

is.matrix(array_2d)   # TRUE
is.matrix(my_matrix)  # TRUE

is.array(array_2d)    # TRUE
is.array(my_matrix)   # TRUE

1.6 Key and Value Pair

Key-Value Pair is a data structure that consists of a key and its corresponding value. In R, this can be implemented using named vectors, lists, or data frames. Usually, the most commonly used case is in the lists and data frames. The values can be extra by providing the corresonding key

key1 <- "Tues"
value1 <- 32
key2 <- "Wed"
value2 <- 28

list_temp <- list()
list_temp[[ key1 ]] <- value1
list_temp[[ key2 ]] <- value2

print(list_temp)
$Tues
[1] 32

$Wed
[1] 28
## Now providing a key - Tues
### First way
list_temp[["Tues"]]
[1] 32
### Second way
list_temp$Tues
[1] 32

1.7 Data Frame

Dataframe is a two-dimensional, tabular data structure in R that can hold different types of variables (numeric, character, factor, etc.) in each column. It is similar to a spreadsheet or SQL table.

iris <- datasets::iris
head(iris)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

1.8 Tidyverse

The tidyverse is a collection of open source packages for the R programming language introduced by Hadley Wickham and his team that “share an underlying design philosophy, grammar, and data structures” of tidy data. Characteristic features of tidyverse packages include extensive use of non-standard evaluation and encouraging piping.

## Load all tidyverse packages
library(tidyverse)

## Or load specific packages in the tidy family
library(dplyr) # Data manipulation
library(ggplot2) # Data visualization
library(readr) # Data import
library(tibble) # Tidy data frames
library(tidyr) # Data tidying
# ...

1.9 Pipe

Pipe operator |> (native after R version 4.0) or %>$ (from magrittr package) is a powerful tool in R that allows you to chain together multiple operations in a clear and concise way. It takes the output of one function and passes it as the first argument to the next function.

For example, we can write

set.seed(777)
x <- rnorm(5)

## Without using pipe
print(round(mean(x), 2))
[1] 0.37
## Using pipe
x |> 
  mean() |> # applying the mean function
  round(2) |> #round to 2nd decimal place
  print()
[1] 0.37

We can see that, without using the pipe, if we are applying multiple functions to the same object, we may have hard time to track. This can make the code less readable and harder to maintain. On the other hand, using pipe, we can clearly see the sequence of operations being applied to the data, making it easier to understand and modify.

1.9.1 Some rules

|> should always have a space before it and should typically be the last thing on a line. This simplifies adding new steps, reorganizing existing ones, and modifying elements within each step.

Note that all of the packages in the tidyverse family support the pipe operator (except ggplot2!), so you can use it with any of them.


Some of the materials are adapted from CMU Stat36-350.