I have an R data frame that looks like:
ID YR SC
ABX:22798 1976 A's Frnd; Cat; Cat & Mse
ABX:23798 1983 A's Frnd; Cat; Zebra Fish
ABX:22498 2010 Zebra Fish
ABX:22728 2010 Bear; Dog; Zebra Fish
ABX:22228 2011 Bear
example data:
df <- structure(list(ID = c("ABX:22798", "ABX:23798", "ABX:22498", "ABX:22728", "ABX:22228"), YR = c(1976, 1983, 2010, 2010, 2011), SC = c("A's Frnd; Cat; Cat & Mse", "A's Frnd; Cat; Zebra Fish", "Zebra Fish", "Bear; Dog; Zebra Fish", "Bear")), .Names = c("ID", "YR", "SC"), row.names = c(NA, 5L), class = "data.frame")
That I would like to transform by splitting the text string in the SC column by "; ". Then, I'd like to use the resulting lists of strings to populate new columns with binary data. The final data frame would look like this:
ID YR A's Frnd Bear Cat Cat & Mse Dog Zebra Fish
ABX:22798 1976 1 0 1 1 0 0
ABX:23798 1983 1 0 1 0 0 1
ABX:22498 2010 0 0 0 0 0 1
ABX:22728 2010 0 1 0 0 1 1
ABX:22228 2011 0 1 0 0 0 0
I'll be analyzing a number of different datasets individually. In any given data set, there are between about 100 and 230 unique SCs entries, and the number of rows per set ranges from about 500 to several thousand. The number of SCs per row ranges from 1 to about 6 or so.
I have had a couple of starts with this, most are quite ugly. I thought the approach below looked promising (it's similar to a python pandas implementation that works well). It would be great to learn a good way to do this in R!
My starter code:
# Get list of unique SCs
SCs <- df[,2]
SCslist <- lapply(SCs, strsplit, split="; ")
SCunique <- unique(unlist(SCslist, use.names = FALSE))
# Sort alphabetically,
# note that apostrophes could be a problem
SCunique <- sort(SCunique)
# create a dataframe of 0s to add to the original df
df0 <- as.data.frame(matrix(0, ncol=length(SCunique), nrow=nrow(df)))
colnames(df0) <- SCunique
...(and then...?)
I've found similar questions/answers, including:
Dummy variables from a string variable
Split strings into columns in R where each string has a potentially different number of column entries
Edit: Found one more answer set of interest: Improve text processing speed using R and data.table
Thanks in advance for your answers.