3

My goal is to get the same number of rows for each split (based on column Initial). I am trying to basically pad the number of rows so that each person has the same amount, while retaining the Initial column so I can tell them apart. My attempt failed completely. Anybody have suggestions?

df<-data.frame(Initials=c("a","a","b"),data=c(2,3,4))
attach(df)

maxrows=max(table(Initials))+1
arr<-split(df,Initials)
lapply(arr,function(x){
  toadd<-maxrows-dim(x)[1]
  replicate(toadd,x<-rbind(x,rep(NA,1)))#colnames -1 because col 1 should the the same Initial
})

Goal:

a 2
a 3
b 4
b NA
Rilcon42
  • 9,584
  • 18
  • 83
  • 167
  • 3
    `attach` is generally bad practice. Maybe try `with`. – Frank Oct 09 '15 at 19:03
  • Can you expand on that some more? Why is attach bad practice? – Rilcon42 Oct 09 '15 at 20:03
  • Google lead me to these Q&A on the topic: http://stackoverflow.com/questions/10067680/why-is-it-not-advisable-to-use-attach-in-r-and-what-should-i-use-instead and http://stackoverflow.com/q/1310247/1191259 – Frank Oct 09 '15 at 20:15

3 Answers3

5

Using data.table...

my_rows <- seq.int(max(tabulate(df$Initials)))

library(data.table)
setDT(df)[ , .SD[my_rows], by=Initials]

#    Initials data
# 1:        a    2
# 2:        a    3
# 3:        b    4
# 4:        b   NA

.SD is the Subset of Data associated with each by= group. We can subset its rows like .SD[row_numbers], unlike a data.frame which requires an additional comma DF[row_numbers,].

The analogue in dplyr is

my_rows <- seq.int(max(tabulate(df$Initials)))

library(dplyr)
setDT(df) %>% group_by(Initials) %>% slice(my_rows)

#   Initials  data
#     (fctr) (dbl)
# 1        a     2
# 2        a     3
# 3        b     4
# 4        b    NA

Strangely, this only works if df is a data.table. I've filed a report/query with dplyr. There's a good chance that the dplyr devs will prevent this usage in a future version.

Frank
  • 66,179
  • 8
  • 96
  • 180
  • 1
    That's a good one. I was thinking a similar approach, you beat me :-) – akrun Oct 09 '15 at 19:18
  • 3
    She's a beaut, Clark! – Rich Scriven Oct 09 '15 at 19:27
  • I just tried your dplyr code on my dataset (sample here: http://pastebin.com/LEzxvK9L) but it have me some really odd results. I got what looks like one row of NA for each person. Any idea why? – Rilcon42 Oct 09 '15 at 20:37
  • @Rilcon42 You need to recreate my `my_rows` now that you're using your full data. It should be 1..48, not 1..2. Anyway, like I said, I think the dplyr code probably won't work in six months; the authors of that package don't seem to be fans of usage like this. – Frank Oct 09 '15 at 20:41
  • That worked beautifully. Why dont you think the dplyr package authors wont like this? (I know very little about how the package is designed) – Rilcon42 Oct 09 '15 at 20:46
  • the outside slice is very weird - you are slicing outside the range of the data, and as its grouped data the Initials column gets filled in with "b". – jeremycg Oct 09 '15 at 20:47
  • @Rilcon42 I think the goal is to give results that are intuitive to folks coming from SQL or elsewhere, so `slice` should just be a nice complement to filter, `slice(my_rows)` <=> `filter(row_number() %in% my_rows)` even though this goes against normal R subsetting (which gives `NA` outside of range). I could be wrong; I'm sure they'll respond to the report I made and clarify whatever the case is. – Frank Oct 09 '15 at 20:55
  • @jeremycg If you look at `getAnywhere(\`slice_.data.table\`)`, then maybe it makes sense. Looks like it translates to `.SD[my_rows]`, which may or may not be intuitive, depending on what direction you're coming from. – Frank Oct 09 '15 at 20:58
  • its a nice trick to get some empty cells inside a group if they leave it in, if we start using it it'll have to stay. – jeremycg Oct 09 '15 at 21:05
  • 1
    @MichaelChirico I think hadley just closed it because it fits better under dtplyr, where it's still open https://github.com/hadley/dtplyr/issues/10 ? – Frank Jan 18 '18 at 10:03
4

Here's a dplyr/tidyr method. We group_by initials, add row_numbers, ungroup, complete row numbers/Initials combinations, then remove our row numbers:

library(dplyr)
library(tidyr)

df %>% group_by(Initials) %>%
       mutate(row = row_number()) %>%
       ungroup() %>%
       complete(Initials, row) %>%
       select(-row)

Source: local data frame [4 x 2]

  Initials  data
    (fctr) (dbl)
1        a     2
2        a     3
3        b     4
4        b    NA
jeremycg
  • 24,657
  • 5
  • 63
  • 74
  • That worked perfectly. Could you explain how you are getting the max number of rows to add to each person? It looks like you are getting them from the row_numbers variable somehow but Im not sure – Rilcon42 Oct 09 '15 at 20:42
  • It creates `row_numbers` for every `Initials`. Then `complete` makes every combination of `row` and `Initials`. This is the part you want. See `?complete` – jeremycg Oct 09 '15 at 20:52
3

Interesting problem. Try:

to.add <- max(table(df$Initials)) - table(df$Initials)
rbind(df, c(rep(names(to.add), to.add), rep(NA, ncol(df)-1)))
#  Initials data
#1        a    2
#2        a    3
#3        b    4
#4        b <NA>

We calculate the number of extra initials needed then combine the extras with NA values then rbind to the data frame.

max(table(df$Initials)) calculates the the initial with the most repeats. In this case a 2. By subtracting that max amount by the other initials table(df$Initials) we get a vector with the necessary additions. There's an added bonus to this method, by using table we also automatically have a named vector.

We use the names of the new vector to know 1) what initials to repeat, and 2) how many times should they be repeated.

To preserve the class of the data, you can add newdf$data <- as.numeric(newdf$data).

Pierre L
  • 28,203
  • 6
  • 47
  • 69