-1

I am doing a statistic analysis in a big data frame (more than 48.000.000 rows) in r. Here is an exemple of the data:

structure(list(herd = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3), cows = c(1, 2, 
3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1, 2, 3, 4, 
5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1, 2, 3, 4, 5, 6, 
7, 8, 9, 10, 11, 12, 13, 14, 15, 16), `date` = c("11/03/2013", 
"12/03/2013", "13/03/2013", "14/03/2013", "15/03/2013", "16/03/2013", 
"13/05/2012", "14/05/2012", "15/05/2012", "16/05/2012", "17/05/2012", 
"18/05/2012", "10/07/2016", "11/07/2016", "12/07/2016", "13/07/2016", 
"11/03/2013", "12/03/2013", "13/03/2013", "14/03/2013", "15/03/2013", 
"16/03/2013", "13/05/2012", "14/05/2012", "15/05/2012", "16/05/2012", 
"17/05/2012", "18/05/2012", "10/07/2016", "11/07/2016", "12/07/2016", 
"13/07/2016", "11/03/2013", "12/03/2013", "13/03/2013", "14/03/2013", 
"15/03/2013", "16/03/2013", "13/05/2012", "14/05/2012", "15/05/2012", 
"16/05/2012", "17/05/2012", "18/05/2012", "10/07/2016", "11/07/2016", 
"12/07/2016", "13/07/2016"), glicose = c(240666, 23457789, 45688688, 
679, 76564, 6574553, 78654, 546432, 76455643, 6876, 7645432, 
876875, 98654, 453437, 98676, 9887554, 76543, 9775643, 986545, 
240666, 23457789, 45688688, 679, 76564, 6574553, 78654, 546432, 
76455643, 6876, 7645432, 876875, 98654, 453437, 98676, 9887554, 
76543, 9775643, 986545, 240666, 23457789, 45688688, 679, 76564, 
6574553, 78654, 546432, 76455643, 6876)), class = c("tbl_df", 
"tbl", "data.frame"), row.names = c(NA, -48L))

I need to identify how many cows are in the following category of glicose by herd and by date:

<=100000
100000 and <=150000
150000 and <=200000
200000 and <=250000
250000 and <=400000
>400000 

I tried to use the functions filter() and select() but could not categorize the variable like that.

I tried either to make a vector for each category but it did not work:

ht <- df %>% group_by(herd, date) %>%
         filter(glicose < 100000)

Actually I do not have a clue of how I could do this. Please help!

I expect to get the number of cows in each category of each herd based on each date in a table like this:

enter image description here

  • 3
    Use `cut` to add a column with the groupings you want, 100000 to 150000, 150000 to 200000, etc. [See this FAQ for examples](https://stackoverflow.com/q/5570293/903061). You can then `group_by` herd, date, and your new column and get a count with `... %>% count()` – Gregor Thomas Aug 12 '19 at 01:01
  • It is diferente camille, because I have two conditions to apply. – Ana Paula Franzoni Aug 12 '19 at 02:13
  • What two conditions? Create a binned variable with `cut`, then group by date and that bin – camille Aug 12 '19 at 02:28
  • The question you suggested just want to count in a column. I want to count based in the herd and date (one condition) and the different interval (other condition). Sorry, but I don`t know how to create a binned variable with cut, then group by date and that bin, could you please show me? – Ana Paula Franzoni Aug 12 '19 at 02:43

1 Answers1

1

Calling your data df,

df %>%
  mutate(glicose_group = cut(glicose, breaks = c(0, seq(1e5, 2.5e5, by = 0.5e5), 4e5, Inf)),
         date = as.Date(date, format = "%d/%m/%Y")) %>%
  group_by(herd, date, glicose_group) %>%
   count
# # A tibble: 48 x 4
# # Groups:   herd, date, glicose_group [48]
#     herd date       glicose_group       n
#    <dbl> <date>     <fct>           <int>
#  1     1 2012-05-13 (0,1e+05]           1
#  2     1 2012-05-14 (4e+05,Inf]         1
#  3     1 2012-05-15 (4e+05,Inf]         1
#  4     1 2012-05-16 (0,1e+05]           1
#  5     1 2012-05-17 (4e+05,Inf]         1
#  6     1 2012-05-18 (4e+05,Inf]         1
#  7     1 2013-03-11 (2e+05,2.5e+05]     1
#  8     1 2013-03-12 (4e+05,Inf]         1
#  9     1 2013-03-13 (4e+05,Inf]         1
# 10     1 2013-03-14 (0,1e+05]           1
# # ... with 38 more rows

I also threw in a conversion to Date class, which is probably a good idea.

Gregor Thomas
  • 136,190
  • 20
  • 167
  • 294
  • thank you very much! It worked, but I got this warning: ``` 1: Factor `glicose_group` contains implicit NA, consider using `forcats::fct_explicit_na` 2: Factor `glicose_group` contains implicit NA, consider using `forcats::fct_explicit_na` ``` Another question, is there a way to create a data.frame with this results in a table like this: herd glicose_group 2012-05-13 2012-05-14 1 <=100000 1 2 100000 and <=150000 3 4 ... – Ana Paula Franzoni Aug 12 '19 at 02:01
  • Sorry, I could not keep the format of the table I meant in the previous coment. I will put it in the question. If this is not allowed I am sorry. – Ana Paula Franzoni Aug 12 '19 at 02:10
  • Another doubt, how can I keep the numbers like this 100000, 1500000, 200000 ... and not like this 1e+05 in the result? Thanks – Ana Paula Franzoni Aug 12 '19 at 02:45
  • You can use the `labels` argument of `cut` to specify the labels however you want. See the help page `?cut` for details. To turn the dates into columns, look at the [FAQ on transforming data from long to wide](https://stackoverflow.com/q/5890584/903061). One option would be to add to the end of my code `... %>% reshape2::dcast(herd + glicose_group ~ date, value.var = "n")` – Gregor Thomas Aug 12 '19 at 13:18
  • Thank you very much Gregor! You helped me a lot! – Ana Paula Franzoni Aug 12 '19 at 15:51