Organise your data

Use R to specify factors, recode variables and begin by-group analyses.

Video

Files

This file contains data on pain score after laparoscopic vs. open hernia repair. Age, gender and primary/recurrent hernia also included. The ultimate aim here is to work out which of these factors are associated with more pain after this operation.

lap_hernia

Script

##########################
# Organise your data     #
# Ewen Harrison          #
# April 2013             #
# www.datasurg.net       #
##########################

data<-read.table("lap_hernia.csv", sep=",", header=TRUE)

# This is how to check your data, recode variables and 
# begin to analyse group differences 

str(data)

# First look and ensure that all your grouped data - categorical - 
# are factors - they are not here.
# Check that the continuous data are integers or numeric. 

# The data is in a dataframe we have called data. 
# To access variables within that dataframe, use the "$" sign.

data$recurrent
summary(data$recurrent)

# Recurrent is a variable describing whether a hernia is 
# being repaired for the first time or is recurrent. 
# It is a factor, yes/no, and should be specified as such. 

# Change a variable to a factor
data$recurrent<-factor(data$recurrent)

# Check
summary(data$recurrent)

# Do the same for others.
data$laparoscopic<-factor(data$laparoscopic)
summary(data$laparoscopic)

# Check full dataset again and note what has changed
str(data)
summary(data)

data$gender

# This variable has a number of different representations of the same thing
# It needs recoded

# Do this by using "<-" 

data$gender[data$gender=="female"]<-"f"
data$gender[data$gender=="fem "]<-"f"
data$gender[data$gender=="m ale"]<-"m"
data$gender[data$gender=="male"]<-"m"

# This is important. R uses "NA" for missing data.
# All missing data should be specified NA.
# This often happens automatically, but hasn't happened in this case.

data$gender[data$gender==""]<-NA

summary(data$gender)

# Note that there all counts are now under the correct levels - 
# "m" and "f"
# Get rid of unused levels by re-defining as a factor:
data$gender<-factor(data$gender)

# This may all seem like a drag, but when you have had to import
# your data 7 times (as usually happens) because of errors
# that someone else made, just being able to ctrl-R this whole page
# to get back to where you were is amazing, rather than click-click
# which you have to do in SPSS etc. 
#---------------------------------------------------------------
# Summarise data by subgroup

# There are lots of ways of doing this, here's a couple. 

# By
help(by)

# Use "by" followed by the dependent variable you want to summarie
# then what you want to summarise by
# then what you want the summary to be.

by(data$pain.score, data$gender, mean)
by(data$pain.score, data$gender, sd)
by(data$pain.score, data$gender, median)
#etc.

# Make a group comparison by graph, boxplots are great
# They show the distribution very well. 

boxplot(data$pain.score~data$gender)

# Split
# This is often taught but I don't use it that much. 
# This splits the dataframe into one containing two dataframes
# defined by the group

data2<-split(data, data$gender)
str(data2)
summary(data2$f)

# Plyr
# This seems intimidating and is. 
# It will be very useful in the future, especially with large datasets
# Try this. 

# install.packages("plyr") #remove "#" first time to install
library(plyr)
help(package=plyr)

# Plyr takes data in any form and outputs in any form. 
# Here the "dd" means take a dataframe and give me one back. 

ddply(data, .(gender), summarise, mean=mean(pain.score), sd=sd(pain.score))

# Please post questions or anything that is not clear.

 

Leave a Reply

Your email address will not be published. Required fields are marked *