23

I looking for a way to efficiently apply a function to each row of data.table. Let's consider the following data table:

library(data.table)
library(stringr)

x <- data.table(a = c(1:3, 1), b = c('12 13', '14 15', '16 17', '18 19'))
> x
   a     b
1: 1 12 13
2: 2 14 15
3: 3 16 17
4: 1 18 19

Let's say I want to split each element of column b by space (thus yielding two rows for each row in the original data) and join the resulting data tables. For the example above, I need the following result:

   a V1
1: 1 12
2: 1 13
3: 2 14
4: 2 15
5: 3 16
6: 3 17
7: 1 18
8: 1 19

The following would work if column a has only unique values:

x[, list(str_split(b, ' ')[[1]]), by = a]

The following almost works (unless there are some identical rows in the original data table), but is ugly when x has many columns and copies column b to the result, which I would like to avoid.

>     x[, list(str_split(b, ' ')[[1]]), by = list(a,b)]
   a     b V1
1: 1 12 13 12
2: 1 12 13 13
3: 2 14 15 14
4: 2 14 15 15
5: 3 16 17 16
6: 3 16 17 17
7: 1 18 19 18
8: 1 18 19 19

What would be the most efficient and idiomatic way to solve this problem?

Victor K.
  • 4,054
  • 3
  • 25
  • 38

7 Answers7

14

How about :

x
   a     b
1: 1 12 13
2: 2 14 15
3: 3 16 17
4: 1 18 19

x[,list(a=rep(a,each=2), V1=unlist(strsplit(b," ")))]
   a V1
1: 1 12
2: 1 13
3: 2 14
4: 2 15
5: 3 16
6: 3 17
7: 1 18
8: 1 19

Generalized solution given comment :

x[,{s=strsplit(b," ");list(a=rep(a,sapply(s,length)), V1=unlist(s))}]
Matt Dowle
  • 58,872
  • 22
  • 166
  • 224
  • Thanks Matthew - this works in my particular example (exactly two components in each b, separated by space) but wouldn't work in a more general case, where each b can have from 1 to 10 components. Which shows that it's hard to precisely specify your question some times :). – Victor K. Mar 28 '13 at 15:56
  • Matt, that's a perfect solution that saved a lot of time and executes quite efficiently. It shows that your DT really has to replace DF in r-base. I'll cite this in my big-data analytics class. One question, how can we make it even more efficient by running it on multi-cores parallely? I've checked htop and one core runs. – Serhat Cevikel Sep 28 '17 at 10:40
  • Hi @SerhatCevikel. Great, thanks. DT is starting to use OpenMP already. Try fwrite for example and you'll see all your cores used. Try setkey() and you'll see it use all cores, partly. Try fread in dev and it uses all cores. – Matt Dowle Sep 28 '17 at 18:48
5
x[, .(a,strsplit(b,' ')), by=1:nrow(x)]

by=nrow(x) is a simple way to force 1 row per by-group

Aaron McDaid
  • 26,501
  • 9
  • 66
  • 88
3
x[, .(a,strsplit(b,' ')), by = .I]

looks more estetic

SymbolixAU
  • 25,502
  • 4
  • 67
  • 139
AlexArgus
  • 1,429
  • 2
  • 9
  • 3
2

One option would be to add a row number

x[, r := 1:nrow(x)]

and then group by r:

x[, list(a, str_split(b, ' ')[[1]]), by = r]

I'm wondering if there are better solutions?

Victor K.
  • 4,054
  • 3
  • 25
  • 38
  • 3
    perhaps more idiomatic, you could include a call to `rownames` in `by` (or better, `keyby`): `x[, list(str_split(b, ' ')[[1]]), keyby=list(a, rownames(x))]`. – Matthew Plourde Mar 28 '13 at 08:28
  • Yes, I like it. I will gladly accept it if you post it as an answer. I'm not sure I need `keyby` though (since I only want to go through the data table once) and you don't need `a` in the key - just rownames(x) should be enough for my purpose. – Victor K. Mar 28 '13 at 15:58
2

The most effective and idiomatic approach is to have a vectorized function.

In this case, some kind of regex will do what you want

 x[, V1 := gsub(" [[:alnum:]]*", "", b)]

   a     b V1
1: 1 12 13 12
2: 2 14 15 14
3: 3 16 17 16
4: 1 18 19 18

If you want to return the each split component, and you know there are two in each one, you can use Map to coerce the result of strsplit into the correct form

x[, c('b1','b2')  := do.call(Map, c(f = c, strsplit(b, ' ')))]



x
   a     b b1 b2
1: 1 12 13 12 13
2: 2 14 15 14 15
3: 3 16 17 16 17
4: 1 18 19 18 19
mnel
  • 113,303
  • 27
  • 265
  • 254
  • I probably haven't explained what I want properly. What I need is the result at the bottom of my question, but without column `b`. In my particular example, each row in the original data table should yield two rows in the result, as each value in `b` splits into two substrings. – Victor K. Mar 28 '13 at 03:31
  • @mnel though arguably a more sensible format, this doesn't accomplish OP's desired result. – Matthew Plourde Mar 28 '13 at 08:31
1

The dplyr/tidyr approach also works with data tables.

library(dplyr)
library(tidyr)
x %>% 
  separate(b, into = c("b1", "b2")) %>% 
  gather(b, "V1", b1:b2) %>%
  arrange(V1) %>%
  select(a, V1)

Or, using the standard evaluation forms:

x %>% 
  separate_("b", into = c("b1", "b2")) %>% 
  gather_("b", "V1", c("b1", "b2")) %>%
  arrange_(~ V1) %>%
  select_(~ a, ~ V1)

The case of different numbers of values in the b column is only slightly more complicated.

library(stringr)

x2 <- data.table(
  a = c(1:3, 1), 
  b = c('12 13', '14', '15 16 17', '18 19')
)

n <- max(str_count(x2$b, " ")) + 1
b_cols <- paste0("b", seq_len(n))
x2 %>% 
  separate_("b", into = b_cols, extra = "drop") %>% 
  gather_("b", "V1", b_cols) %>%
  arrange_(~ V1) %>%
  select_(~ a, ~ V1)
Richie Cotton
  • 118,240
  • 47
  • 247
  • 360
0

Looking at input and desired output, this should work -

x <- data.frame(a=c(1,2,3,1),b=c("12 13","14 15","16 17","18 19"))
data.frame(a=rep(x$a,each=2), new_b=unlist(strsplit(as.character(x$b)," ")))
Nishanth
  • 6,932
  • 5
  • 26
  • 38