# R Lab
# R is an object-oriented statistical package! For you, that means traditional
# point-and-click forms of analysis are gone. In R, instead, we ``assign'' things
# to ``objects'' and then manipulate those objects to perform analysis. The way
# this looks is often much more like actual computer code than anything else. Don't
# worry if it doesn't come to you right away. It involves a shift in the way that
# you think about how to do analysis. But I promise that it gets better.
# R is comprised of two windows: the R Console and the R Editor. The R Console
# is where all of the analysis happens. The R Editor is where you physically type
# into your R Script the command that you want R to run. You continuously save your
# R script, so any time, you can come back later and get to where you were. You
# save your R Console far less often. We run commands from the Editor into the
# Console by CTRL + R or right-click, ``Run line or section.'' (On Mac: Command + Enter.)
# The ``comment'' function in R is through the # sign. So R will ignore any line
# that begins with this symbol.
# If you ever see a command you don't know, type ?``command'' in the R Console.
# Since it is object-oriented, the most important thing we're going to do is
# assign things to objects. This involves the assignment arrow, <-, which can
# also be replaced by an equal sign, =. The thing on the left of the assignment
# arrow is the object. The thing on the right is what we're assigning! It can be
# almost anything: a single number, a group of numbers, a ``vector'' (a column of
# numbers, just like a column of numbers in a dataset).
# Assigning values to objects (we can name these objects anything we want)
a = 2
b <= 3
b - a
c <- b - a
d = 1:2
d
c + d
c*d
f = "I'm so confused"
# To see what is in objects, you can either ``print'' or just type the object name.
print (f)
f
print("It will be ok")
# Since we're interested in data analysis, that is, finding numeric patterns in
# groups of data, we'll be working with vectors most of the time. The basic function
# here is ``c()'', which is short for concatenate.
# This is an example of creating a new object called ``vector'' with the numbers
# 1 to 5.
vector <- c(1,2,3,4,5)
# We can also use c() to join together two objects. Recall our object ``a'' from earlier
# (it's the number 2). What do you think the following line will do? Were you right?
vector2 <- c(a, vector)
vector2
# If we wanted a vector with more than five numbers (say 50), we might not want to type
# all 50 numbers. The colon : means ``from ... to''. So 1:5 means 1, 2, 3, 4, 5.
# Let's see this in action!
quicker <- c(1:10)
# Another useful command is seq, which is short for sequence! It creates a vector of numbers
# starting with one number and ending in another.
quicker2 = seq(1,10, by=0.5)
# What if we forgot what ``seq'' did, as a command?
?seq
# This semester is going to be focused on learning a variety of R commands that enable
# you to do data analysis. For each command, we'll learn what the statistics mean in
# class, then how to perform them in R.
# Calling out elements of vectors. When you see square brackets, you are calling
# out that ``element'' of that object. So we're calling out the 3rd element of
# the object ``vector''
vector[3]
# R is actually a lot smarter than you might think. You can ask to see a set of elements
# using the : method from earlier
quicker[3:8]
# Or maybe you want to see only the elements that fit a criteria?
# How about all the elements of ``quicker'' that are greater than 5
quicker[quicker>5]
# Or the elements of vector2 that are exactly equal to 1?
vector2[vector2 == 1]
# So far we've asked R to do boring things. It can also do fun things!
# Generate a random normal number!
rnorm(1)
# Generate a 0 or a 1 (heads or tails!)
rbinom(1, 1, 0.5)
# Calculate the odds of winning the Powerball!
(5/69)*(4/68)*(3/67)*(2/66)*(1/65)*(1/26)
# Published odds: 1 in 292,201,338
1/292201338
# Calculate your class grade without having to do it a bunch of times!
# (Object assignment is fun!)
Participation <- 90
Midterm <- 95
Research.Design <- 95
Annotated.Bibliography <- 85
Presentation <- 100
final <- Participation*0.20 + Midterm*0.30 + Research.Design*0.20 + Annotated.Bibliography*0.20 + Presentation*0.10
final
print(final)
# Like we said, though, you're going to be using R for analysis. So we're especially
# interested in those things that give us descriptions of our data. The following
# are some useful descriptive statistics functions. In R, things with parentheses
# wrapped around them are FUNCTIONS. You've already learned a few functions: c(),
# seq(), and print() are all functions. Functions are the lifeblood of data analysis
# in R. We apply functions to objects in order to perform analysis.
# Two notes I forgot yesterday:
# (1) Comments!
# (2) R is case-sensitive
quicker <- c(1:10)
# We can do a variety of univariate (single-variable) statistical analyses in R.
summary(quicker)
length(quicker)
sum(quicker)
mean(quicker)
var(quicker)
max(quicker)
min(quicker)
# What happens with missing data? This can often break simple functions. The problem
# is that our data is missing all of the time. Notice this.
quickerNA <- c(NA, quicker)
mean(quickerNA)
# But notice also the solution!
mean(quickerNA, na.rm = T)
mean(na.omit(quickerNA))
# Vectors are single columns of data. Matrices are columns stacked next to each other.
# This is important because most of our data are matrices.
matrix1 <- matrix(c(1,2,3,4,5,6))
mat <- matrix(c(1,2,3,4,5,6), nrow=2, ncol=3, byrow=TRUE)
mat
# You'll notice that you're starting to see things to the right of a comma here.
# Those are optional arguments to functions. If we ever forgot what they are, how
# would we look them up?
?matrix
mat1 <- matrix(1:9, nrow=3, byrow=T)
mat1
quicker <- c(quicker, quicker)
# We can name rows and columns with functions, too. Best rule of thumb, if you don't
# remember what a function is in order to do something, Google it!
rownames(mat1) <- c("One","Two","Three")
colnames(mat1) <- c("A","B","C")
mat1
# Doing it all in one step. Note: there are many ways to get to the same thing
# in R. Take the way that makes you feel the most comfortable. There aren't extra
# points for doing it in smaller steps.
mat2 <- matrix(c(1,2,3,4,5,6), nrow = 2, ncol = 3, byrow = TRUE,
dimnames = list(c("row1", "row2"),
c("col1", "col2", "col3")))
mat2
# Remember when we called out elements from vectors? We can do that with matrices, too!
# Always remember that R thinks of things in [rows,columns]. So when we ask for
# [1,1], we're asking for the element in the first row, first column. [2,2] is second
# row, second column. If we don't supply a number, like [,1], it thinks of that
# as all of the rows in the first column! Let's look.
mat1[1,1]
mat1[1,]
mat1[,1]
# Why would this break?
mat1[26,]
# Through the semester, you're almost entirely going to be working with datasets.
# The easiest way to get your data in is when it is saved as a .csv file. This is a
# little known option in Excel, but it standardizes how programs ``think'' about data
# Let's create a little dataset, save it, and then read it back into R!
height <- c() # height in inches
male <- c() # gender dummy
cbind(height, male)
# ``Frame'' data together, then write it to the jump drive
dataset <- data.frame(male, height)
# Demonstrate reading variables in datasets
dataset$height
# Create a dataset
write.csv(dataset, "/Users/scj0014/Desktop/heightdata.csv")
# READING IN DATA. We have to reverse the direction of the slashes. Since it is a .csv
# file, we use the command read.csv
heightdata <- read.csv("/Users/scj0014/Desktop/heightdata.csv")
# Now all of our data are in a single dataset object, ``heightdata''
# We can look at the whole dataset
heightdata
print(heightdata)
# Summary statistics on an entire dataset? Good place to start!
summary(heightdata)
# A useful command to view the first couple of rows (to get a sense)
head(heightdata)
# What if we want to look at the first row of heightdata? Any thoughts?
heightdata[1,]
# This dataset has variables in it. How many? How many cases? What are their names?
names(heightdata)
dim(heightdata)
# How does R think about variables inside of datasets (or more generally, objects
# ``indexed'' within other objects)?
height
# Can't just look at it. The dollar sign $ is what R uses to get to objects
# within objects
heightdata$height
# We're going to create objects for each of the variables we're interested in!
tallness <- heightdata$height
man <- heightdata$male
# Each of the things we did before can be done on datasets, too!
# This is where we'll start actually ``analyzing'' data
# How tall are the guys alone?
tallness[man == 1]
tallness[man == 0]
tallness[8]
# We can create new variables any time we want
heightinfeet <- tallness/12
# For those who do have programming experience, traditional things like ifelse,
# for, while, etc. still apply.
oversixfeet <- ifelse(heightinfeet >= 6, 1, 0)
# Or suppose that we wanted to recode measures we knew were wrong (like a negative value of
# height) to be missing instead. What's missing data in R?
tallness[tallness < 0] <- NA
# Or recode other values
tallness1 <- tallness
tallness1[tallness1 < 60] <- NA
tallness1
# One of the most important things you can do in R is analyze data. Some of the
# best (though not most exact) analysis is simple visualization of univariate
# and bivariate (that is, one and two variable) relationships.
hist(tallness)
plot(density(tallness))
plot(man, tallness)
table(man, tallness)
# What an ugly plot! Preview changing defaults ...
plot(man, tallness, xlab = "Sex (Male = 1)", ylab = "Height (In Inches)",
pch = 2, col = "blue", lwd = 3)
# We might also be interested in some basic bivariate (two-variable) methods.
# Make some fake data.
# First make some sample men and others women
male <- c(rep(1, 1379), rep(0, 1810))
# Then give them fake approval (0/1) of the president
obama <- c(rep(0, 682), rep(1, 697),
rep(0, 752), rep(1, 1058))
income <- c(round(rnorm(682, 50000, 10000)),
round(rnorm(697, 75000, 15000)),
round(rnorm(752, 50000, 10000)),
round(rnorm(1058, 75000, 15000)))
# Now give them fake feeling thermometers (continous)
set.seed(1)
feeling <- c(round(rnorm(682, 35, 5)),
round(rnorm(697, 55, 5)),
round(rnorm(752, 35, 5)),
round(rnorm(1058, 55, 5)))
# Also make data with no relationships for comparison
male.fake <- c(rep(1, 1000), rep(0, 1000))
obama.fake <- c(rep(0, 498), rep(1, 502),
rep(0, 501), rep(1, 499))
set.seed(1)
feeling.curves <- c(round(rnorm(1000, 84, 3)),
round(rnorm(1000, 84, 4)))
# income.fake <- round(rnorm(length(male), 75000, 5000))
# Two discrete variables? (With small numbers of categories?) Appropriate method is
# categorical analysis.
table(obama, male)
chisq.test(obama, male, correct = F)
table(male.fake, obama.fake)
chisq.test(obama.fake, male.fake, correct = F)
# One discrete variable and one continuous variable? Appropriate method is t test
par(mfrow = c(2, 1))
hist(subset(feeling, male == 1), main = "Male Feelings")
hist(subset(feeling, male == 0), main = "Female Feelings")
t.test(subset(feeling, male == 1), subset(feeling, male == 0))
par(mfrow = c(2, 1))
hist(subset(feeling.curves, male.fake == 1), main = "Male Feelings")
hist(subset(feeling.curves, male.fake == 0), main = "Female Feelings")
t.test(subset(feeling.curves, male.fake == 1),
subset(feeling.curves, male.fake == 0))
# Two continous variables? Appropriate method is Pearsons r
dev.off()
plot(income, feeling)
r <- cor(income, feeling)
r
plot(income.fake, feeling)
r <- cor(income.fake, feeling)
r