You may change the data type using the following functions, but the chance is that some of the information will be missing. Do this with caution!
x <- piprint(x)
[1] 3.141593
x_int <-as.integer(x)print(x_int)
[1] 3
Some of the conversion functions:
as.integer(): Convert to integer.
as.numeric(): Convert to numeric (float).
as.character(): Convert to character.
as.logical(): Convert to logical (boolean).
as.Date(): Convert to date.
as.factor(): Convert to factor (categorical variable).
as.list(): Convert to list.
as.matrix(): Convert to matrix.
as.data.frame(): Convert to data frame.
as.vector(): Convert to vector.
as.complex(): Convert to complex number.
1.2 Operators
Unary: With only one argument. E.g., -x (negation), !x (logical negation).
Binary: With two arguments. E.g., x + y (addition), x - y (subtraction), x * y (multiplication), x / y (division).
1.2.1 Comparison Operator
Comparing two objects. E.g., x == y (equal), x != y (not equal), x < y (less than), x > y (greater than), x <= y (less than or equal to), x >= y (greater than or equal to).
1.2.2 Logical Operator
Logical operators are used to combine or manipulate logical values (TRUE or FALSE). E.g., x & y (logical AND), x | y (logical OR), !x (logical NOT).
We shall note that the logical operators in R are vectorized, x | y and x || y are different. The former is vectorized, while the latter is not.
x <-c(TRUE, FALSE, FALSE)y <-c(TRUE, FALSE, FALSE)x | y # [1] TRUE FALSE FALSEx || y # This will return an error
1.3 Indexing
Indexing is a way to access or modify specific elements in a data structure. In R, indexing can be done using square brackets [] for vectors and matrices, or the $ operator for data frames. Note that the index starts from 0 in R, which is different from some other programming languages like Python.
1.4 Naming
In R, you can assign names to objects using the names() function. This is useful for making your code more readable and for accessing specific elements in a data structure.
A good practice is to use _ (underscore) to separate words in variable names, e.g., my_variable. This makes the code more readable and easier to understand.
# Assign names to a vectortemp <-c(20, 30, 27, 31, 45)names(temp) <-c("Mon", "Tues", "Wed", "Thurs", "Fri")print(temp)
One may define an array or a matrix in R using the array() or matrix() functions, respectively. An array is a multi-dimensional data structure, while a matrix is a two-dimensional array.
# Create a 1-dimensional arrayarray_1d <-array(1:10, dim =10)array_1d
[1] 1 2 3 4 5 6 7 8 9 10
# Create a 2-dimensional arrayarray_2d <-array(1:12, dim =c(4, 3))array_2d
Key-Value Pair is a data structure that consists of a key and its corresponding value. In R, this can be implemented using named vectors, lists, or data frames. Usually, the most commonly used case is in the lists and data frames. The values can be extra by providing the corresonding key
## Now providing a key - Tues### First waylist_temp[["Tues"]]
[1] 32
### Second waylist_temp$Tues
[1] 32
1.7 Data Frame
Dataframe is a two-dimensional, tabular data structure in R that can hold different types of variables (numeric, character, factor, etc.) in each column. It is similar to a spreadsheet or SQL table.
The apply() function is the basic model of the family of apply functions in R, which includes specific functions like lapply(), sapply(), tapply(), mapply(), vapply(), rapply(), bapply(), eapply(), and others. These functions are used to apply a function to elements of a data structure (like a vector, list, or data frame) in a (sometimes) more efficient and concise way than using loops.
x <-cbind(x1 =3, x2 =c(4:1, 2:5))dimnames(x)[[1]] <- letters[1:8]print(x)
x1 x2
a 3 4
b 3 3
c 3 2
d 3 1
e 3 2
f 3 3
g 3 4
h 3 5
apply(x, MARGIN =2, mean) #apply the mean function to their "columns"
x1 x2
3 3
col.sums <-apply(x, MARGIN =2, sum) #apply the sum function to their "columns"row.sums <-apply(x, MARGIN =1, sum) #apply the sum function to their "rows"rbind(cbind(x, Rtot = row.sums), Ctot =c(col.sums, sum(col.sums)))
x1 x2 Rtot
a 3 4 7
b 3 3 6
c 3 2 5
d 3 1 4
e 3 2 5
f 3 3 6
g 3 4 7
h 3 5 8
Ctot 24 24 48
Some of the commonly used apply functions:
lapply: Apply a Function over a List or Vector
sapply: a user-friendly version and wrapper of lapply by default returning a vector, matrix
vapply: similar to sapply, but has a pre-specified type of return value, so it can be safer (and sometimes faster) to use.
1.9 Tidyverse
The tidyverse is a collection of open source packages for the R programming language introduced by Hadley Wickham and his team that “share an underlying design philosophy, grammar, and data structures” of tidy data. Characteristic features of tidyverse packages include extensive use of non-standard evaluation and encouraging piping.
## Load all tidyverse packageslibrary(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.2 ✔ tibble 3.3.0
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.1.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## Or load specific packages in the tidy familylibrary(dplyr) # Data manipulationlibrary(ggplot2) # Data visualizationlibrary(readr) # Data importlibrary(tibble) # Tidy data frameslibrary(tidyr) # Data tidying# ...
1.10 Pipe
Pipe operator |> (native after R version 4.0) or %>$ (from magrittr package) is a powerful tool in R that allows you to chain together multiple operations in a clear and concise way. It takes the output of one function and passes it as the first argument to the next function.
For example, we can write
set.seed(777)x <-rnorm(5)## Without using pipeprint(round(mean(x), 2))
[1] 0.37
## Using pipex |>mean() |># applying the mean functionround(2) |>#round to 2nd decimal placeprint()
[1] 0.37
We can see that, without using the pipe, if we are applying multiple functions to the same object, we may have hard time to track. This can make the code less readable and harder to maintain. On the other hand, using pipe, we can clearly see the sequence of operations being applied to the data, making it easier to understand and modify.
1.10.1 Some rules
|> should always have a space before it and should typically be the last thing on a line. This simplifies adding new steps, reorganizing existing ones, and modifying elements within each step.
Note that all of the packages in the tidyverse family support the pipe operator (except ggplot2!), so you can use it with any of them.
1.11 Questions in class
1.11.1 Lecture 1, August 25, 2025
Q1. If I know Python already, why learn R?
Reply: My general take are 1). R is more specialized for statistical analysis and data visualization, while Python is a more general-purpose programming language. 2). R has a rich ecosystem of packages and libraries specifically designed for statistical computing, making it a popular choice among statisticians and data scientists. 3). R’s syntax and data structures are often more intuitive for statistical tasks, which can lead to faster development and easier collaboration with other statisticians. 4). Also, the tidyverse ecosystem including ggplot and others are a big plus when dealing with big dataframes. 5). They are not meant to replace each other, but work as a complement.
Q2. Why my installation of R sometimes failed on a Windows machine?
Reply: There are many reasons. One of the most common reasons is that you may need to manually add the path to the environment variable.
1.11.2 Lecture 2, August 27, 2025
Q1. What’s the difference of using apply v.s. looping in R?
Reply: The apply functions are often faster and more efficient than looping, especially for large datasets, because they have done some vectorization under the hood. Also, it has much higher readibility and better conciseness. However, depends on the task, you may want to do the benchmarking to see the performance difference.
Q2. How to use pipe with two or more variables?
Reply: There are several ways to do this.
Within the tidyverse family: One way is to use the dplyr package, which provides a set of functions that work well with the pipe operator. For example, you can use the mutate() function to create a new variable based on two existing variables. For example, you can do
library(dplyr)library(magrittr) # for %$%library(purrr) # for pmap / exec if neededmy_df <-tibble(x =1:5, y =6:10)f <-function(a, b) a +2*bmy_df %>%mutate(z =f(x, y))
# A tibble: 5 × 3
x y z
<int> <int> <dbl>
1 1 6 13
2 2 7 16
3 3 8 19
4 4 9 22
5 5 10 25
Using base R, you may do something like the following through the magrittr package’s exposition pipe %$%:
library(magrittr)# method 1my_df %$%f(x, y)
[1] 13 16 19 22 25
# or use . as a placeholder# method 2my_df %>% { f(.$x, .$y) }