2

I have the following R code. Essentially, I am asking R to arrange the dataset based on postcode and paon, then group them by id, and finally keep only the last row within each group. However, R requires more than 3 hours to do this.

I am not sure what I am doing wrong with my code since there is no for loop here.

epc2 is a vector with 324,368 rows.

epc3 <- epc2 %>%
  arrange(postcode, paon) %>% 
  group_by(id) %>% 
  do(tail(., 1)) 

Thank you for any and all of your help.

Henrik
  • 65,555
  • 14
  • 143
  • 159
D M
  • 81
  • 6
  • 7
    `do` is slow. Try other alternatives: [Select first and last row from grouped data](https://stackoverflow.com/questions/31528981/select-first-and-last-row-from-grouped-data), [How to select the first and last row within a grouping variable in a data frame?](https://stackoverflow.com/questions/8203818/how-to-select-the-first-and-last-row-within-a-grouping-variable-in-a-data-frame), – Henrik Feb 15 '19 at 06:59
  • 2
    [last by group for all columns data.table](https://stackoverflow.com/questions/14143220/last-by-group-for-all-columns-data-table) – Henrik Feb 15 '19 at 07:04
  • The data.table approach is likely going to be the fastest, but it should already be faster if you replace your last line by `summarize_all(last)` – meriops Feb 15 '19 at 08:16
  • Thank you for your replies. summarize_all(last) did the trick for me – D M Feb 15 '19 at 08:25
  • A similar case is described in [Select the first row by group](https://stackoverflow.com/questions/13279582/select-the-first-row-by-group/50955051#50955051). I recommend dplyr::group_by, dplyr::filter combined with dplyr::row_number to solve issues like this – Kresten Feb 15 '19 at 10:13

1 Answers1

1

How about:

mtcars %>% 
  arrange(cyl) %>% 
  group_by(cyl) %>% 
  slice(n())
davsjob
  • 1,882
  • 15
  • 10