3

I need a fast and concise way to split string literals in a data framte into a set of columns. Let's say I have this data frame

data <- data.frame(id=c(1,2,3), tok1=c("a, b, c", "a, a, d", "b, d, e"), tok2=c("alpha|bravo", "alpha|charlie", "tango|tango|delta") )

(pls note the different delimiters among columns)

The number of string columns is usually not known in advance (altough I can try to discover the whole set of cases if I've no alternatives)

I need two data frames like those:

tok1.occurrences:
    +----+---+---+---+---+---+
    | id | a | b | c | d | e | 
    +----+---+---+---+---+---+
    |  1 | 1 | 1 | 1 | 0 | 0 |
    |  2 | 2 | 0 | 0 | 1 | 0 |
    |  3 | 0 | 1 | 0 | 1 | 1 |
    +----+---+---+---+---+---+

tok2.occurrences:
    +----+-------+-------+---------+-------+-------+
    | id | alpha | bravo | charlie | delta | tango | 
    +----+-------+-------+---------+-------+-------+
    |  1 |   1   |   1   |    0    |   0   |   0   |
    |  2 |   1   |   0   |    1    |   0   |   0   |
    |  3 |   0   |   0   |    0    |   1   |   2   |
    +----+-------+-------+---------+-------+-------+

I tried using this syntax:

tok1.f = factor(data$tok1)
dummies <- model.matrix(~tok1.f)

this ended up in a incomplete solution. It creates my dummy vars correctly, but not (obviously) splitting against the delimiter.

I know i can use the 'tm' package to find a document-term matrix, but it's seems way too much for such simple tokenization. Is there a more straight way?

Gabriele B
  • 2,665
  • 1
  • 25
  • 40
  • 1
    And [**here**](http://stackoverflow.com/q/16267552/1478381) too (which I would argue is the proper question to use as a close reason). – Simon O'Hanlon Sep 24 '14 at 08:44
  • Actually I have voted to reopen this question. Though they are *very* similar, they are not *exact* duplicates. However I would suggest you illustrate your question with what you have tried - it'll garner goodwill if nothing else. At the moment, you do not have a coding error/problem, you have a task that you want someone else to solve for you. – Simon O'Hanlon Sep 24 '14 at 08:50
  • I have no coding errors because I don't know which code to write for the task. However, I actually *did* some test using the tm package. Basically, I used the package to build a document-term matrix against a dictionary of terms from the various alpha, bravo, charlie, a, b... – Gabriele B Sep 24 '14 at 08:53
  • Added a first (unsuccessful) try – Gabriele B Sep 24 '14 at 09:35

3 Answers3

6

The easiest thing that I can think of is to use my cSplit function in conjunction with dcast.data.table, like this:

library(splitstackshape)
dcast.data.table(cSplit(data, "tok1", ", ", "long"), 
                 id ~ tok1, value.var = "tok1", 
                 fun.aggregate = length)
#    id a b c d e
# 1:  1 1 1 1 0 0
# 2:  2 2 0 0 1 0
# 3:  3 0 1 0 1 1

dcast.data.table(cSplit(data, "tok2", "|", "long"), 
                 id ~ tok2, value.var = "tok2", 
                 fun.aggregate = length)
#    id alpha bravo charlie delta tango
# 1:  1     1     1       0     0     0
# 2:  2     1     0       1     0     0
# 3:  3     0     0       0     1     2

Edit: Updated with library(splitstackshape) since cSplit is now part of that package.

A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485
1

If you don't mind using data.table (temporarily), this might work for you:

library(data.table)

data <- data.frame(id=c(1,2,3), 
                   tok1=c("a, b, c", "a, a, d", "b, d, e"), 
                   tok2=c("alpha|bravo", "alpha|charlie", "tango|tango|delta"))

splitCols <- function(col_name, data) {

  # strsplit needs strings

  data[, col_name] <- as.character(data[, col_name])

  # make a list of single row data frames from the tabulation
  # of each of items from the split column

  tokens <- lapply(strsplit(data[, col_name], "[^[:alnum:]]+"), function(x) {
    tab <- table(x)
    setNames(rbind.data.frame(as.numeric(tab)), names(tab))
  })

  # use data.table's rbindlist, filling in missing values

  rbl <- rbindlist(tokens, fill=TRUE)

  # 0 out the NA's

  rbl[is.na(rbl)] <- 0

  # add the "id" column

  cbind(id=data$id, rbl)

}

lapply(names(data)[-1], splitCols, data)

## [[1]]
##    id a b c d e
## 1:  1 1 1 1 0 0
## 2:  2 2 0 0 1 0
## 3:  3 0 1 0 1 1
## 
## [[2]]
##    id alpha bravo charlie delta tango
## 1:  1     1     1       0     0     0
## 2:  2     1     0       1     0     0
## 3:  3     0     0       0     1     2

You end up with a list of data frames that you can then process as you see fit.

hrbrmstr
  • 77,368
  • 11
  • 139
  • 205
0

You could use stringr package as follows:

require(stringr)

test_data <- data.frame(id=c(1,2,3), tok1=c("a, b, c", "a, a, d", "b, d, e"), tok2=c("alpha|bravo", "alpha|charlie", "tango|tango|delta") )

#conversion to character class and uniform delimeter as ","
test_data$tok1<-as.character(test_data$tok1)
test_data$tok1<-gsub(" ","",test_data$tok1)
test_data$tok2=gsub("\\|",",",as.character(test_data$tok2))

#Unique list of elements for each column
tok1.uniq=sort(unique(unlist(strsplit(as.character(test_data$tok1),","))))
tok2.uniq=sort(unique(unlist(strsplit(as.character(test_data$tok2),","))))

#Token count for each column

#In each row of token, find the count of characters using str_count from stringr package

Column one:

tok1.occurances=do.call(cbind,lapply(tok1.uniq,function(x) {

DF=data.frame(do.call(rbind,lapply(test_data$tok1,function(y,z=x) str_count(y,z))))
colnames(DF) = x
return(DF)

}
))

#Add ID number as column
tok1.occurances=data.frame(id=as.numeric(row.names(tok1.occurances)),tok1.occurances,stringsAsFactors=FALSE)


# > tok1.occurances
# id a b c d e
#  1 1 1 1 0 0
#  2 2 0 0 1 0
#  3 0 1 0 1 1

Column two:

tok2.occurances=do.call(cbind,lapply(tok2.uniq,function(x) {

DF=data.frame(do.call(rbind,lapply(test_data$tok2,function(y,z=x) str_count(y,z))))
colnames(DF) = x
return(DF)

}
))

tok2.occurances=data.frame(id=as.numeric(row.names(tok2.occurances)),tok2.occurances,stringsAsFactors=FALSE)


# > tok2.occurances
# id alpha bravo charlie delta tango
#  1     1     1       0     0     0
#  2     1     0       1     0     0
#  3     0     0       0     1     2
Silence Dogood
  • 3,587
  • 1
  • 13
  • 17