R: converting list with embedded identifiers into data frame

Question

I am brand new to R and have been learning a lot from looking through other questions here on this fine website!

but now I am dealing with a data management issue that I can't figure out from other examples, so I'm hoping that you can help.

I have a set of survey responses that I've read in from a csv file and wrangled into a vector formatted as in the following example:

test <- c(
  "[1234],Bob Smith,",
  "Q-0,Male",
  "Q-1,18-25",
  "Q-2,Computer Science",
  ",",
  "[5678],Julie Lewis",
  "Q-0,Female",
  "Q-1,18-25",
  ",",
  ","
)

Note that "," appears on its own line because I used fill=TRUE in read.csv to deal with the fact that not all of the lines were the same length. Also note that not all questions have been answered by all respondents.

I need to turn this into a data frame of the following structure:

     ID      name         gender   age    major
1    [1234]  Bob Smith    Male     18-25  Computer Science
2    [5678]  Julie Lewis  Female   18-25  NA
   ...

It seems that I can't read the vector into a matrix or data frame by rows because of the fact that not all questions have been answered by all respondents. Any advice on how to deal with this?

if you edit your question and add some example data that is easy to cut and paste into an R terminal, I bet you get several good answers in a few minutes. As it stands, it's hard to really understand the structure of your data. You can give paste the results of `dput(head(yourDataStructure))` into the question and then we'll have exactly the same data you're working with. Other good tips [here](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). One thing I don't see in your example data are the column headers, are we to assume you'll add those later? — Chase, Jun 24 '12 at 02:13
@Chase's suggestions are good, but I would add a request for the structure of the raw CSV you're trying to read in. — Joshua Ulrich, Jun 24 '12 at 02:57
thanks to all for the thorough answers and suggestions for how to most usefully ask questions! I now have a lot to work with, really appreciate it. — elfs, Jun 24 '12 at 07:21

Andrie · Answer 1 · 2012-06-24T06:32:54.987

2

You will probably save yourself a lot of trouble to read the csv file in the correct format in the first place. read.csv is a powerful function that should be able to cope with your data , and this munging shouldn't be necessary.

However, here goes:

x <- matrix(test, byrow=TRUE, ncol=5)
x <- x <- sub("Q-\\w+,", "", x)
x[x==","] <- NA
x <- cbind(matrix(unlist(strsplit(x[, 1], ",")), byrow=TRUE, ncol=2), x[, -1])
x <- as.data.frame(x, stringsAsFactors=FALSE)
names(x) <- c("ID", "Name", "Gender", "Age", "Major", "V1")

This results in:

x

      ID        Name Gender   Age            Major   V1
1 [1234]   Bob Smith   Male 18-25 Computer Science <NA>
2 [5678] Julie Lewis Female 18-25             <NA> <NA>

edited Jun 24 '12 at 06:32

answered Jun 24 '12 at 05:46

Andrie

176,377
47
447
496

Won't this fall down if there are different numbers of rows per person? That was the main reason why my attempt was far more kludgy. I was assuming that manually adding rows and cleaning the spreadsheet was trying to be avoided. – thelatemail Jun 24 '12 at 06:55
@thelatemail Perhaps, but I've assumed that each person has the same number of rows. That's what you will get from a survey dump to a csv file. – Andrie Jun 24 '12 at 07:14
I assume nothing after spending all yesterday manually repairing a database of survey data that had inconsistent formatting. :-P – thelatemail Jun 24 '12 at 07:19
thanks, I'll probably end up reading in the file again but seeing an elegant example is super useful – elfs Jun 24 '12 at 07:22
yep @thelatemail, there are a different number of rows for some users, I'm going to take some of your suggestions on how to deal with that. gonna take me a little while, though, definitely clawing my way up the learning curve here – elfs Jun 24 '12 at 07:24
@thelatemail Ouch. You have my sympathy. – Andrie Jun 24 '12 at 08:39

score 0 · Accepted Answer · answered Jun 24 '12 at 04:35

This is a bit clunky, but it works.

Here's the data:

test <- c(
"[1234],Bob Smith,",
"Q-0,Male",
"Q-1,18-25",
"Q-2,Computer Science",
",",
"[5678],Julie Lewis",
"Q-0,Female",
"Q-1,18-25",
",",
"[1234],Bob Smith,",
"Q-1,18-25",
"Q-2,Computer Science",
","
)

Here's the manipulation code:

#remove rows with just a comma
test <- test[test!=","]
#find id cases and remove the commas between the id and the name
#and add an id label
idcases <- grep("\\[.*\\]",test)
test[idcases] <- paste("id,",gsub(",","",test[idcases]),sep="")
#find id values positions and end position
idvals <- c(idcases,length(test)+1)
#generate a sequence identifier for each respondent
setid <- rep(1:(length(idvals)-1),diff(idvals))
#put the set id against each value
result1 <- paste(setid,test,sep=",")
#split the strings up and make them a data.frame
result2 <- data.frame(do.call(rbind,strsplit(result1,",")))
#get the final dataset with a reshape
final <- reshape(result2,idvar="X1",timevar="X2",direction="wide")[,-1]
#clean up the names etc
names(final) <- c("name","gender","age","major")
final$id <-  gsub("(\\[.*\\])(.*)","\\1",final$name)
final$name <- gsub("(\\[.*\\])(.*)","\\2",final$name)

Which gives:

> final
         name gender   age            major     id
1   Bob Smith   Male 18-25 Computer Science [1234]
5 Julie Lewis Female 18-25             <NA> [5678]
8   Bob Smith   <NA> 18-25 Computer Science [1234]

nice strategy for dealing with missing values for some respondents — elfs, Jun 24 '12 at 07:39

R: converting list with embedded identifiers into data frame

2 Answers2