1

I have an R data frame that looks like:

ID          YR      SC
ABX:22798   1976    A's Frnd; Cat; Cat & Mse
ABX:23798   1983    A's Frnd; Cat; Zebra Fish
ABX:22498   2010    Zebra Fish
ABX:22728   2010    Bear; Dog; Zebra Fish
ABX:22228   2011    Bear

example data:

df <- structure(list(ID = c("ABX:22798", "ABX:23798", "ABX:22498", "ABX:22728", "ABX:22228"), YR = c(1976, 1983, 2010, 2010, 2011), SC = c("A's Frnd; Cat; Cat & Mse", "A's Frnd; Cat; Zebra Fish", "Zebra Fish", "Bear; Dog; Zebra Fish", "Bear")), .Names = c("ID", "YR", "SC"), row.names = c(NA, 5L), class = "data.frame")

That I would like to transform by splitting the text string in the SC column by "; ". Then, I'd like to use the resulting lists of strings to populate new columns with binary data. The final data frame would look like this:

ID          YR   A's Frnd   Bear    Cat  Cat & Mse  Dog Zebra Fish
ABX:22798   1976    1       0       1    1          0   0
ABX:23798   1983    1       0       1    0          0   1
ABX:22498   2010    0       0       0    0          0   1
ABX:22728   2010    0       1       0    0          1   1
ABX:22228   2011    0       1       0    0          0   0

I'll be analyzing a number of different datasets individually. In any given data set, there are between about 100 and 230 unique SCs entries, and the number of rows per set ranges from about 500 to several thousand. The number of SCs per row ranges from 1 to about 6 or so.

I have had a couple of starts with this, most are quite ugly. I thought the approach below looked promising (it's similar to a python pandas implementation that works well). It would be great to learn a good way to do this in R!

My starter code:

# Get list of unique SCs
SCs <- df[,2]
SCslist <- lapply(SCs, strsplit, split="; ")
SCunique <- unique(unlist(SCslist, use.names = FALSE))
# Sort alphabetically, 
#     note that apostrophes could be a problem  
SCunique <- sort(SCunique)
# create a dataframe of 0s to add to the original df
df0 <- as.data.frame(matrix(0, ncol=length(SCunique), nrow=nrow(df)))
colnames(df0) <- SCunique

...(and then...?)

I've found similar questions/answers, including:

Dummy variables from a string variable
Split strings into columns in R where each string has a potentially different number of column entries

Edit: Found one more answer set of interest: Improve text processing speed using R and data.table

Thanks in advance for your answers.

Community
  • 1
  • 1

2 Answers2

2

I think this should do what you're looking for:

library(reshape2)
a <- strsplit(as.character(df$SC), "; ", fixed = TRUE)
dfL <- merge(df, melt(a), by.x = "row.names", by.y = "L1", all = TRUE)
dcast(dfL, ID + YR ~ value, value.var="value", fun.aggregate=length, fill = 0)
#          ID   YR A's Frnd Cat Cat & Mse Zebra Fish Bear Dog
# 1 ABX:22228 2011        0   0         0          0    1   0
# 2 ABX:22498 2010        0   0         0          1    0   0
# 3 ABX:22728 2010        0   0         0          1    1   1
# 4 ABX:22798 1976        1   1         1          0    0   0
# 5 ABX:23798 1983        1   1         0          1    0   0

Factor the "value" column first if the column order is important to you.


I also have a package called "splitstackshape" that would work nicely with this if your data didn't have quotes in it. That's a bug that I'm looking into. Once it is resolved, it should be possible to do a command like:

library(reshape2)
library(splitstackshape)
dcast(concat.split.multiple(df, "SC", ";", "long"), 
      ID + YR ~ SC, value.var="SC", fun.aggregate=length, fill=0)

to get what you need.

A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485
  • Hi, thanks for your answer... I did look at your splitstackshape earlier - it seemed so close to what I need! One other small issue I had with using splitstackshape was my 2 character split ("; ")... – user3109755 Dec 17 '13 at 09:41
  • @user3109755, you don't actually need to use `"; "` with "splitstackshape" if the purpose is just to strip the whitespace out. The functions I've written take care of that automatically. But, as I said, it presently won't work with the apostrophe, hence my suggested workaround with "reshape2". – A5C1D2H2I1M1N2O1R2T1 Dec 17 '13 at 09:43
  • I modified the first line slightly, to: a <- strsplit(as.character(df_sc3[,"SC"]), "; ", fixed = TRUE) - the only difference is the [,"SC"]. I keep running into an issue when I try to run the second line. I get the error "Error: C stack usage is too close to the limit". I've tried cleaning up my workspace, shutting down other programs, and increasing my 'Cstack_info()["size"]', but to no avail. Any ideas? – user3109755 Dec 17 '13 at 09:52
  • @user3109755, not sure. Did you verify the output of "a"? Did you try running `melt(a)` separately to see that it works correctly? I combined several steps in one in my answer, but maybe you should just look into each individual part and see if you can identify where the error is introduced. – A5C1D2H2I1M1N2O1R2T1 Dec 17 '13 at 12:57
  • Thanks for the lead. Your solution worked perfectly once melt was removed from the dfL <- line: # library(reshape2) # a <- strsplit(as.character(df[,"SC"]), "; ", fixed = TRUE) # b <- melt(a) # dfL <- merge(df, b, by.x="row.names", by.y = "L1", all = TRUE) # dfX <- dcast(dfL, UT + YR ~ value, value.var="value", fun.aggregate=length, fill = 0) # this also worked with including melt in the first line: # b <- melt(strsplit(as.character(df[,"SC"]), "; ", fixed = TRUE)) – user3109755 Dec 17 '13 at 16:33
0

Continuing on your logic, you can do something like:

##your code
# Get list of unique SCs
SCs <- df[,3]  #NOTE: here, you have df[,2], but I guess should have been df[,3]
SCslist <- lapply(SCs, strsplit, split="; ")
SCunique <- unique(unlist(SCslist, use.names = FALSE))
# Sort alphabetically, 
#     note that apostrophes could be a problem  
SCunique <- sort(SCunique)
##

df0 <- sapply(SCunique, function(x) as.integer(grepl(x, SCs)))
df0
#     A's Frnd Bear Cat Cat & Mse Dog Zebra Fish
#[1,]        1    0   1         1   0          0
#[2,]        1    0   1         0   0          1
#[3,]        0    0   0         0   0          1
#[4,]        0    1   0         0   1          1
#[5,]        0    1   0         0   0          0
alexis_laz
  • 12,884
  • 4
  • 27
  • 37