Applying a function to each row of a data.table

Question

I looking for a way to efficiently apply a function to each row of data.table. Let's consider the following data table:

library(data.table)
library(stringr)

x <- data.table(a = c(1:3, 1), b = c('12 13', '14 15', '16 17', '18 19'))
> x
   a     b
1: 1 12 13
2: 2 14 15
3: 3 16 17
4: 1 18 19

Let's say I want to split each element of column b by space (thus yielding two rows for each row in the original data) and join the resulting data tables. For the example above, I need the following result:

The following would work if column a has only unique values:

x[, list(str_split(b, ' ')[[1]]), by = a]

The following almost works (unless there are some identical rows in the original data table), but is ugly when x has many columns and copies column b to the result, which I would like to avoid.

>     x[, list(str_split(b, ' ')[[1]]), by = list(a,b)]
   a     b V1
1: 1 12 13 12
2: 1 12 13 13
3: 2 14 15 14
4: 2 14 15 15
5: 3 16 17 16
6: 3 16 17 17
7: 1 18 19 18
8: 1 18 19 19

What would be the most efficient and idiomatic way to solve this problem?

Matt Dowle · Accepted Answer · 2013-03-28T16:03:33.680

14

How about :

x
   a     b
1: 1 12 13
2: 2 14 15
3: 3 16 17
4: 1 18 19

x[,list(a=rep(a,each=2), V1=unlist(strsplit(b," ")))]
   a V1
1: 1 12
2: 1 13
3: 2 14
4: 2 15
5: 3 16
6: 3 17
7: 1 18
8: 1 19

Generalized solution given comment :

x[,{s=strsplit(b," ");list(a=rep(a,sapply(s,length)), V1=unlist(s))}]

edited Mar 28 '13 at 16:03

answered Mar 28 '13 at 13:51

Matt Dowle

58,872
22
166
224

Thanks Matthew - this works in my particular example (exactly two components in each b, separated by space) but wouldn't work in a more general case, where each b can have from 1 to 10 components. Which shows that it's hard to precisely specify your question some times :). – Victor K. Mar 28 '13 at 15:56
Matt, that's a perfect solution that saved a lot of time and executes quite efficiently. It shows that your DT really has to replace DF in r-base. I'll cite this in my big-data analytics class. One question, how can we make it even more efficient by running it on multi-cores parallely? I've checked htop and one core runs. – Serhat Cevikel Sep 28 '17 at 10:40
Hi @SerhatCevikel. Great, thanks. DT is starting to use OpenMP already. Try fwrite for example and you'll see all your cores used. Try setkey() and you'll see it use all cores, partly. Try fread in dev and it uses all cores. – Matt Dowle Sep 28 '17 at 18:48

score 5 · Answer 2 · answered Sep 11 '16 at 19:59

5

x[, .(a,strsplit(b,' ')), by=1:nrow(x)]

by=nrow(x) is a simple way to force 1 row per by-group

answered Sep 11 '16 at 19:59

Aaron McDaid

26,501
9
66
88

score 3 · Answer 3 · edited Apr 03 '17 at 23:35

3

x[, .(a,strsplit(b,' ')), by = .I]

looks more estetic

edited Apr 03 '17 at 23:35

SymbolixAU

25,502
4
67
139

answered Oct 30 '16 at 03:29

AlexArgus

1,429
2
9
3

Victor K. · Answer 4 · 2013-03-28T03:37:05.450

2

One option would be to add a row number

x[, r := 1:nrow(x)]

and then group by r:

x[, list(a, str_split(b, ' ')[[1]]), by = r]

I'm wondering if there are better solutions?

edited Mar 28 '13 at 03:37

answered Mar 28 '13 at 03:28

Victor K.

4,054
3
25
38

3

perhaps more idiomatic, you could include a call to `rownames` in `by` (or better, `keyby`): `x[, list(str_split(b, ' ')[[1]]), keyby=list(a, rownames(x))]`. – Matthew Plourde Mar 28 '13 at 08:28
Yes, I like it. I will gladly accept it if you post it as an answer. I'm not sure I need `keyby` though (since I only want to go through the data table once) and you don't need `a` in the key - just rownames(x) should be enough for my purpose. – Victor K. Mar 28 '13 at 15:58

mnel · Answer 5 · 2013-03-28T03:42:33.590

2

The most effective and idiomatic approach is to have a vectorized function.

In this case, some kind of regex will do what you want

 x[, V1 := gsub(" [[:alnum:]]*", "", b)]

   a     b V1
1: 1 12 13 12
2: 2 14 15 14
3: 3 16 17 16
4: 1 18 19 18

If you want to return the each split component, and you know there are two in each one, you can use Map to coerce the result of strsplit into the correct form

x[, c('b1','b2')  := do.call(Map, c(f = c, strsplit(b, ' ')))]



x
   a     b b1 b2
1: 1 12 13 12 13
2: 2 14 15 14 15
3: 3 16 17 16 17
4: 1 18 19 18 19

edited Mar 28 '13 at 03:42

answered Mar 28 '13 at 03:28

mnel

113,303
27
265
254

I probably haven't explained what I want properly. What I need is the result at the bottom of my question, but without column `b`. In my particular example, each row in the original data table should yield two rows in the result, as each value in `b` splits into two substrings. – Victor K. Mar 28 '13 at 03:31
@mnel though arguably a more sensible format, this doesn't accomplish OP's desired result. – Matthew Plourde Mar 28 '13 at 08:31

score 1 · Answer 6 · answered May 06 '15 at 10:13

The dplyr/tidyr approach also works with data tables.

library(dplyr)
library(tidyr)
x %>% 
  separate(b, into = c("b1", "b2")) %>% 
  gather(b, "V1", b1:b2) %>%
  arrange(V1) %>%
  select(a, V1)

Or, using the standard evaluation forms:

x %>% 
  separate_("b", into = c("b1", "b2")) %>% 
  gather_("b", "V1", c("b1", "b2")) %>%
  arrange_(~ V1) %>%
  select_(~ a, ~ V1)

The case of different numbers of values in the b column is only slightly more complicated.

library(stringr)

x2 <- data.table(
  a = c(1:3, 1), 
  b = c('12 13', '14', '15 16 17', '18 19')
)

n <- max(str_count(x2$b, " ")) + 1
b_cols <- paste0("b", seq_len(n))
x2 %>% 
  separate_("b", into = b_cols, extra = "drop") %>% 
  gather_("b", "V1", b_cols) %>%
  arrange_(~ V1) %>%
  select_(~ a, ~ V1)

Nishanth · Answer 7 · 2013-03-29T08:55:32.020

0

Looking at input and desired output, this should work -

x <- data.frame(a=c(1,2,3,1),b=c("12 13","14 15","16 17","18 19"))
data.frame(a=rep(x$a,each=2), new_b=unlist(strsplit(as.character(x$b)," ")))

edited Mar 29 '13 at 08:55

answered Mar 28 '13 at 04:07

Nishanth

6,932
5
26
38

That doesn't return the output requested (notice ordering of `a`). – Matt Dowle Mar 28 '13 at 13:47
Oh ok. Thats easy to fix :) – Nishanth Mar 29 '13 at 08:54
and to generalize the result replace each=2 by - each = length(unlist(strsplit(as.character(x$b)))) – Nishanth Mar 29 '13 at 08:57

Applying a function to each row of a data.table

7 Answers7

Linked

Related