Programming with R

Key Points

Analyzing Patient Data	Use `variable <- value` to assign a value to a variable in order to record it in memory. Objects are created on demand whenever a value is assigned to them. The function `dim` gives the dimensions of a data frame. Use `object[x, y]` to select a single element from a data frame. Use `from:to` to specify a sequence that includes the indices from `from` to `to`. All the indexing and subsetting that works on data frames also works on vectors. Use `#` to add comments to programs. Use `mean`, `max`, `min` and `sd` to calculate simple statistics. Use `apply` to calculate statistics across the rows or columns of a data frame. Use `plot` to create simple visualizations.
Creating Functions	Define a function using `name <- function(...args...) {...body...}`. Call a function using `name(...values...)`. R looks for variables in the current stack frame before looking for them at the top level. Use `help(thing)` to view help for something. Put comments at the beginning of functions to provide help for that function. Annotate your code! Specify default values for arguments when defining a function using `name = value` in the argument list. Arguments can be passed by matching based on name, by position, or by omitting them (in which case the default value is used).
Analyzing Multiple Data Sets	Use `for (variable in collection)` to process the elements of a collection one at a time. The body of a `for` loop is surrounded by curly braces (`{}`). Use `length(thing)` to determine the length of something that contains other values. Use `list.files(path = "path", pattern = "pattern", full.names = TRUE)` to create a list of files whose names match a pattern.
Making Choices	Save a plot in a pdf file using `pdf("name.pdf")` and stop writing to the pdf file with `dev.off()`. Use `if (condition)` to start a conditional statement, `else if (condition)` to provide additional tests, and `else` to provide a default. The bodies of conditional statements must be surrounded by curly braces `{ }`. Use `==` to test for equality. `X && Y` is only true if both X and Y are true. `X \|\| Y` is true if either X or Y, or both, are true.
Command-Line Programs	Use `commandArgs(trailingOnly = TRUE)` to obtain a vector of the command-line arguments that a program was run with. Avoid silent failures. Use `file("stdin")` to connect to a program’s standard input. Use `cat(vec, sep = " ")` to write the elements of `vec` to standard output, one per line.
Best Practices for Writing R Code	Start each program with a description of what it does. Then load all required packages. Consider what working directory you are in when sourcing a script. Use comments to mark off sections of code. Put function definitions at the top of your file, or in a separate file if there are many. Name and style code consistently. Break code into small, discrete pieces. Factor out common operations rather than repeating them. Keep all of the source files for a project in one directory and use relative paths to access them. Keep track of the memory used by your program. Always start with a clean environment instead of saving the workspace. Keep track of session information in your project folder. Have someone else review your code. Use version control.
Dynamic Reports with knitr	Use knitr to generate reports that combine text, code, and results. Use Markdown to format text. Put code in blocks delimited by triple back quotes followed by `{r}`.
Making Packages in R	A package is the basic unit of reusability in R. Every package must have a DESCRIPTION file and an R directory containing code. These are created by us. A NAMESPACE file is needed as well, and a man directory containing documentation, but both can be autogenerated.
Introduction to RStudio	Using RStudio can make programming in R much more productive.
Addressing Data	Data in data frames can be addressed by index (subsetting), by logical vector, or by name (columns only). Use the `$` operator to address a column by name.
Reading and Writing CSV Files	Import data from a .csv file using the `read.csv(...)` function. Understand some of the key arguments available for importing the data properly, including `header`, `stringsAsFactors`, `as.is`, and `strip.white`. Write data to a new .csv file using the `write.csv(...)` function Understand some of the key arguments available for exporting the data properly, such as `row.names`, `col.names`, and `na`.
Understanding Factors	Factors are used to represent categorical data. Factors can be ordered or unordered. Some R functions have special methods for handling factors.
Data Types and Structures	R’s basic data types are character, numeric, integer, complex, and logical. R’s basic data structures include the vector, list, matrix, data frame, and factors. Some of these structures require that all members be of the same data type (e.g. vectors, matrices) while others permit multiple data types (e.g. lists, data frames). Objects may have attributes, such as name, dimension, and class.
The Call Stack	R keeps track of active function calls using a call stack comprised of stack frames. Only global variables and variables in the current stack frame can be accessed directly.
Loops in R	Where possible, use vectorized operations instead of `for` loops to make code faster and more concise. Use functions such as `apply` instead of `for` loops to operate on the values in a data structure.
Introduction to R

Basic Operation

# this is a comment in R
Use x <- 3 to assign a value, 3, to a variable, x
R counts from 1, unlike many other programming languages (e.g., Python)
length(thing) returns the number of elements contained in the variable collection
c(value1, value2, value3) creates a vector
container[i] selects the i’th element from the variable container

List objects in current environment ls()

Remove objects in current environment rm(x)

Remove all objects from current environment rm(list = ls())

Control Flow

Create a conditional using if, else if, and else

if(x > 0){
	print("value is positive")
} else if (x < 0){
	print("value is negative")
} else{
	print("value is neither positive nor negative")
}

create a for loop to process elements in a collection one at a time

for (i in 1:5) {
	print(i)
}

This will print:

Use == to test for equality
- 3 == 3, will return TRUE,
- 'apple' == 'orange' will return FALSE
X & Y is TRUE is both X and Y are true
X | Y is TRUE if either X or Y, or both are true

Functions

Defining a function:

is_positive <- function(integer_value){
	if(integer_value > 0){
	   TRUE
	}
	else{
	   FALSE
	{
}

In R, the last executed line of a function is automatically returned

Specifying a default value for a function argument

increment_me <- function(value_to_increment, value_to_increment_by = 1){
	value_to_increment + value_to_increment_by
}

increment_me(4), will return 5

increment_me(4, 6), will return 10

Call a function by using function_name(function_arguments)
apply family of functions: apply(), sapply(), lapply(), and mapply()

apply(dat, MARGIN = 2, mean) will return the average (mean) of each column in dat

Packages

Install package by using install.packages("package-name")
Update packages by using update.packages("package-name")
Load packages by using library("package-name")

Glossary

argument: A value given to a function or program when it runs. The term is often used interchangeably (and inconsistently) with parameter.
call stack: A data structure inside a running program that keeps track of active function calls. Each call’s variables are stored in a stack frame; a new stack frame is put on top of the stack for each call, and discarded when the call is finished.
comma-separated values (CSV): A common textual representation for tables in which the values in each row are separated by commas.
comment: A remark in a program that is intended to help human readers understand what is going on, but is ignored by the computer. Comments in Python, R, and the Unix shell start with a # character and run to the end of the line; comments in SQL start with --, and other languages have other conventions.
conditional statement: A statement in a program that might or might not be executed depending on whether a test is true or false.
dimensions (of an array): An array’s extent, represented as a vector. For example, an array with 5 rows and 3 columns has dimensions (5,3).
documentation: Human-language text written to explain what software does, how it works, or how to use it.
encapsulation: The practice of hiding something’s implementation details so that the rest of a program can worry about what it does rather than how it does it.
for loop: A loop that is executed once for each value in some kind of set, list, or range. See also: while loop.
function body: The statements that are executed inside a function.
function call: A use of a function in another piece of software.
function composition: The immediate application of one function to the result of another, such as f(g(x)).
index: A subscript that specifies the location of a single value in a collection, such as a single pixel in an image.
loop variable: The variable that keeps track of the progress of the loop.
notional machine: An abstraction of a computer used to think about what it can and will do.
parameter: A variable named in the function’s declaration that is used to hold a value passed into the call. The term is often used interchangeably (and inconsistently) with argument.
pipe: A connection from the output of one program to the input of another. When two or more programs are connected in this way, they are called a “pipeline”.
return statement: A statement that causes a function to stop executing and return a value to its caller immediately.
silent failure: Failing without producing any warning messages. Silent failures are hard to detect and debug.
slice: A regular subsequence of a larger sequence, such as the first five elements or every second element.
stack frame: A data structure that provides storage for a function’s local variables. Each time a function is called, a new stack frame is created and put on the top of the call stack. When the function returns, the stack frame is discarded.
standard input (stdin): A process’s default input stream. In interactive command-line applications, it is typically connected to the keyboard; in a pipe, it receives data from the standard output of the preceding process.
standard output (stdout): A process’s default output stream. In interactive command-line applications, data sent to standard output is displayed on the screen; in a pipe, it is passed to the standard input of the next process.
string: Short for “character string”, a sequence of zero or more characters.
while loop: A loop that keeps executing as long as some condition is true. See also: for loop.