Split a column of strings into variable number of columns using data.table in R

Question

I want to analyze many years of of Quicken home finance records. I exported a file to qif and used bank2csv program to render a csv. Within Quicken one can use a category (eg automobile, tax), subcategories (eg automobile:service, automobile:fuel) and tags (eg self, spouse, son). bank2csv renders the categories:subcategories/tag as a concatenated string. I want to instead put the category in a category column, subcategory in a subcategory column and put whatever tags in the tag column. I saw a similar question but alas that worked bystrsplit then unlist and then indexing each element so that it could be written to the correct place by assignment. That will not work here since sometimes there is no tag and sometimes there is no subcategory. It is quite easy to split the string into a list and save that list in a column but how on earth does one assign the first element of the list to one column and the second element (if it exists) to a second column. Surely there is an elegant easy way.

simplified sample

library(data.table)
library(stringi)
dt <- data.table(category.tag=c("toys/David", "toys/David", "toys/James", "toys", "toys", "toys/James"), transaction=1:6)

How do I create a third and fourth column: category, tag. Some of tag would be NA

I can do the following but it does not get me very far. I need a way to specify the first or the second element of the resultant list (as opposed to the whole list)

dt[, category:= strsplit(x = category.tag, split = "/") ]

Arun · Accepted Answer · 2015-04-01T15:40:50.617

Just pushed two functions transpose() and tstrsplit() in data.table v1.9.5.

With this we can do:

require(data.table)
dt[, c("category", "tag") := tstrsplit(category.tag, "/", fixed=TRUE)]
#    category.tag transaction category   tag
# 1:   toys/David           1     toys David
# 2:   toys/David           2     toys David
# 3:   toys/James           3     toys James
# 4:         toys           4     toys    NA
# 5:         toys           5     toys    NA
# 6:   toys/James           6     toys James

tstrsplit is a wrapper for transpose(strsplit(as.character(x), ...)). And you can also pass fill=. to fill missing values with any other value than NA.

transpose() can also be used on lists, data frames and data tables.

Rich Scriven · Answer 2 · 2014-11-24T23:33:39.963

4

You could use cSplit

library(splitstackshape)
dt[, c("category", "tag") := cSplit(dt[,.(category.tag)], "category.tag", "/")]
dt
#    category.tag transaction category   tag
# 1:   toys/David           1     toys David
# 2:   toys/David           2     toys David
# 3:   toys/James           3     toys James
# 4:         toys           4     toys    NA
# 5:         toys           5     toys    NA
# 6:   toys/James           6     toys James

edited Nov 24 '14 at 23:33

answered Nov 24 '14 at 23:10

Rich Scriven

97,041
11
181
245

Is `cSplit` an abbreviation of `concat.split`? – Farrel Nov 24 '14 at 23:20
I believe it's similar, with some added functionality – Rich Scriven Nov 24 '14 at 23:27
I love the one line easy elegance. – Farrel Nov 24 '14 at 23:30
@Farrel, it's an alias, essentially. Several people were complaining that `concat.split` was too awkward to type, so when I was working on improving the function's efficiency, I changed the name to `cSplit`. – A5C1D2H2I1M1N2O1R2T1 Nov 25 '14 at 01:12
@AnandaMahto the reason that I asked, is that when I searched for it in Rdocumenctation.org under function, nothing came up. – Farrel Nov 25 '14 at 02:42
1

@Farrel, The package version for "splitstackshape" at Rdocumentation is 1.2, but the present version is 1.4.2. `concat.split` is essentially Gabor's `read.table` approach, while `cSplit` is a *much* faster implementation using `strsplit`. By version 1.6, I plan to have moved over to `stri_split`, which would be even faster. Here's a (functional) [toy implementation](https://gist.github.com/mrdwab/3865d516fcc575d7099c) of where `cSplit` should be soon. – A5C1D2H2I1M1N2O1R2T1 Nov 25 '14 at 03:20
A little complication. In my simple example I only showed two possible fields with one separate being the hash. In actual fact it is categories:subcategories/tag. Can I specify any of two separators so `sep = ":|/"` or what about cat is the first, subcategory is whatever comes after the : and tags are whatever comes over the slash /? – Farrel Nov 25 '14 at 04:38
I decided to run two lines. The first line separates the category:subcategory from the tag which is after the /. Then I run a similar line that spits the category from the subcategory that uses the colon as the separator. It worked well. – Farrel Nov 25 '14 at 05:16

G. Grothendieck · Answer 3 · 2014-11-25T13:10:54.173

2

1) Try read.table

read <- function(x) read.table(text = x, sep  = "/", fill = TRUE, na.strings = "")
dt[, c("category", "tag") := read(category.tag)]

No extra packages are needed.

2) An alternative is to use separate in the tidyr package:

library(tidyr)
separate(dt, category.tag, c("category", "tag"), extra = "drop")

The above is with tidyr version 0.1.0.9000 from github. To install it ensure that the devtools R package is installed and issue the command: devtools::install_github("hadley/tidyr") .

Update: Incorporated Richard Scrivens' comment and minor improvements. Added tidyr solution.

edited Nov 25 '14 at 13:10

answered Nov 24 '14 at 23:14

G. Grothendieck

254,981
17
203
341

Absolutely wonderful. Now how should I decide who to award the "accepted" to. You or @RichardScriven. I like being introduced to new packages that may make my life easier with other problems but then again I assume there is an overhead when I use new packages instead of the functions that are already loaded. – Farrel Nov 24 '14 at 23:29
getting `Error in strsplit(value, sep, ...) : unused argument (extra = "drop")` – Farrel Nov 25 '14 at 04:43
I am using version 0.1 – Farrel Nov 25 '14 at 05:09
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/65554/discussion-between-farrel-and-g-grothendieck). – Farrel Nov 25 '14 at 05:20

score 1 · Answer 4 · answered Nov 25 '14 at 01:22

Since you already have "stringi" loaded, you can also look at stri_split_fixed and the simplify argument:

setnames(cbind(dt, stri_split_fixed(dt$category.tag, "/", simplify = TRUE)), 
         c("V1", "V2"), c("category", "tag"))[]
#    category.tag transaction category   tag
# 1:   toys/David           1     toys David
# 2:   toys/David           2     toys David
# 3:   toys/James           3     toys James
# 4:         toys           4     toys    NA
# 5:         toys           5     toys    NA
# 6:   toys/James           6     toys James

Though I must admit I'm partial to cSplit :-)

Split a column of strings into variable number of columns using data.table in R

4 Answers4