1

I am brand new to R and have been learning a lot from looking through other questions here on this fine website!

but now I am dealing with a data management issue that I can't figure out from other examples, so I'm hoping that you can help.

I have a set of survey responses that I've read in from a csv file and wrangled into a vector formatted as in the following example:

test <- c(
  "[1234],Bob Smith,",
  "Q-0,Male",
  "Q-1,18-25",
  "Q-2,Computer Science",
  ",",
  "[5678],Julie Lewis",
  "Q-0,Female",
  "Q-1,18-25",
  ",",
  ","
)

Note that "," appears on its own line because I used fill=TRUE in read.csv to deal with the fact that not all of the lines were the same length. Also note that not all questions have been answered by all respondents.

I need to turn this into a data frame of the following structure:

     ID      name         gender   age    major
1    [1234]  Bob Smith    Male     18-25  Computer Science
2    [5678]  Julie Lewis  Female   18-25  NA
   ...

It seems that I can't read the vector into a matrix or data frame by rows because of the fact that not all questions have been answered by all respondents. Any advice on how to deal with this?

Andrie
  • 176,377
  • 47
  • 447
  • 496
elfs
  • 165
  • 1
  • 1
  • 4
  • 3
    if you edit your question and add some example data that is easy to cut and paste into an R terminal, I bet you get several good answers in a few minutes. As it stands, it's hard to really understand the structure of your data. You can give paste the results of `dput(head(yourDataStructure))` into the question and then we'll have exactly the same data you're working with. Other good tips [here](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). One thing I don't see in your example data are the column headers, are we to assume you'll add those later? – Chase Jun 24 '12 at 02:13
  • 2
    @Chase's suggestions are good, but I would add a request for the structure of the raw CSV you're trying to read in. – Joshua Ulrich Jun 24 '12 at 02:57
  • thanks to all for the thorough answers and suggestions for how to most usefully ask questions! I now have a lot to work with, really appreciate it. – elfs Jun 24 '12 at 07:21

2 Answers2

2

You will probably save yourself a lot of trouble to read the csv file in the correct format in the first place. read.csv is a powerful function that should be able to cope with your data , and this munging shouldn't be necessary.

However, here goes:

x <- matrix(test, byrow=TRUE, ncol=5)
x <- x <- sub("Q-\\w+,", "", x)
x[x==","] <- NA
x <- cbind(matrix(unlist(strsplit(x[, 1], ",")), byrow=TRUE, ncol=2), x[, -1])
x <- as.data.frame(x, stringsAsFactors=FALSE)
names(x) <- c("ID", "Name", "Gender", "Age", "Major", "V1")

This results in:

x

      ID        Name Gender   Age            Major   V1
1 [1234]   Bob Smith   Male 18-25 Computer Science <NA>
2 [5678] Julie Lewis Female 18-25             <NA> <NA>
Andrie
  • 176,377
  • 47
  • 447
  • 496
  • Won't this fall down if there are different numbers of rows per person? That was the main reason why my attempt was far more kludgy. I was assuming that manually adding rows and cleaning the spreadsheet was trying to be avoided. – thelatemail Jun 24 '12 at 06:55
  • @thelatemail Perhaps, but I've assumed that each person has the same number of rows. That's what you will get from a survey dump to a csv file. – Andrie Jun 24 '12 at 07:14
  • I assume nothing after spending all yesterday manually repairing a database of survey data that had inconsistent formatting. :-P – thelatemail Jun 24 '12 at 07:19
  • thanks, I'll probably end up reading in the file again but seeing an elegant example is super useful – elfs Jun 24 '12 at 07:22
  • yep @thelatemail, there are a different number of rows for some users, I'm going to take some of your suggestions on how to deal with that. gonna take me a little while, though, definitely clawing my way up the learning curve here – elfs Jun 24 '12 at 07:24
  • @thelatemail Ouch. You have my sympathy. – Andrie Jun 24 '12 at 08:39
0

This is a bit clunky, but it works.

Here's the data:

test <- c(
"[1234],Bob Smith,",
"Q-0,Male",
"Q-1,18-25",
"Q-2,Computer Science",
",",
"[5678],Julie Lewis",
"Q-0,Female",
"Q-1,18-25",
",",
"[1234],Bob Smith,",
"Q-1,18-25",
"Q-2,Computer Science",
","
)

Here's the manipulation code:

#remove rows with just a comma
test <- test[test!=","]
#find id cases and remove the commas between the id and the name
#and add an id label
idcases <- grep("\\[.*\\]",test)
test[idcases] <- paste("id,",gsub(",","",test[idcases]),sep="")
#find id values positions and end position
idvals <- c(idcases,length(test)+1)
#generate a sequence identifier for each respondent
setid <- rep(1:(length(idvals)-1),diff(idvals))
#put the set id against each value
result1 <- paste(setid,test,sep=",")
#split the strings up and make them a data.frame
result2 <- data.frame(do.call(rbind,strsplit(result1,",")))
#get the final dataset with a reshape
final <- reshape(result2,idvar="X1",timevar="X2",direction="wide")[,-1]
#clean up the names etc
names(final) <- c("name","gender","age","major")
final$id <-  gsub("(\\[.*\\])(.*)","\\1",final$name)
final$name <- gsub("(\\[.*\\])(.*)","\\2",final$name)

Which gives:

> final
         name gender   age            major     id
1   Bob Smith   Male 18-25 Computer Science [1234]
5 Julie Lewis Female 18-25             <NA> [5678]
8   Bob Smith   <NA> 18-25 Computer Science [1234]
thelatemail
  • 91,185
  • 12
  • 128
  • 188