Use R to specify factors, recode variables and begin by-group analyses.
Video
Files
This file contains data on pain score after laparoscopic vs. open hernia repair. Age, gender and primary/recurrent hernia also included. The ultimate aim here is to work out which of these factors are associated with more pain after this operation.
Script
########################## # Organise your data # # Ewen Harrison # # April 2013 # # www.datasurg.net # ########################## data<-read.table("lap_hernia.csv", sep=",", header=TRUE) # This is how to check your data, recode variables and # begin to analyse group differences str(data) # First look and ensure that all your grouped data - categorical - # are factors - they are not here. # Check that the continuous data are integers or numeric. # The data is in a dataframe we have called data. # To access variables within that dataframe, use the "$" sign. data$recurrent summary(data$recurrent) # Recurrent is a variable describing whether a hernia is # being repaired for the first time or is recurrent. # It is a factor, yes/no, and should be specified as such. # Change a variable to a factor data$recurrent<-factor(data$recurrent) # Check summary(data$recurrent) # Do the same for others. data$laparoscopic<-factor(data$laparoscopic) summary(data$laparoscopic) # Check full dataset again and note what has changed str(data) summary(data) data$gender # This variable has a number of different representations of the same thing # It needs recoded # Do this by using "<-" data$gender[data$gender=="female"]<-"f" data$gender[data$gender=="fem "]<-"f" data$gender[data$gender=="m ale"]<-"m" data$gender[data$gender=="male"]<-"m" # This is important. R uses "NA" for missing data. # All missing data should be specified NA. # This often happens automatically, but hasn't happened in this case. data$gender[data$gender==""]<-NA summary(data$gender) # Note that there all counts are now under the correct levels - # "m" and "f" # Get rid of unused levels by re-defining as a factor: data$gender<-factor(data$gender) # This may all seem like a drag, but when you have had to import # your data 7 times (as usually happens) because of errors # that someone else made, just being able to ctrl-R this whole page # to get back to where you were is amazing, rather than click-click # which you have to do in SPSS etc. #--------------------------------------------------------------- # Summarise data by subgroup # There are lots of ways of doing this, here's a couple. # By help(by) # Use "by" followed by the dependent variable you want to summarie # then what you want to summarise by # then what you want the summary to be. by(data$pain.score, data$gender, mean) by(data$pain.score, data$gender, sd) by(data$pain.score, data$gender, median) #etc. # Make a group comparison by graph, boxplots are great # They show the distribution very well. boxplot(data$pain.score~data$gender) # Split # This is often taught but I don't use it that much. # This splits the dataframe into one containing two dataframes # defined by the group data2<-split(data, data$gender) str(data2) summary(data2$f) # Plyr # This seems intimidating and is. # It will be very useful in the future, especially with large datasets # Try this. # install.packages("plyr") #remove "#" first time to install library(plyr) help(package=plyr) # Plyr takes data in any form and outputs in any form. # Here the "dd" means take a dataframe and give me one back. ddply(data, .(gender), summarise, mean=mean(pain.score), sd=sd(pain.score)) # Please post questions or anything that is not clear.