1

Is there a way within R to make a function that would make subsets (for example by dates) into it's own data frame? For example I have 30 days worth of data, and I want to break each day down into individual days and output it into a new individual data frame. I can't figure out how to do it in a function. Any clues?

Example: Dataframe: df_of_month

Output desired via a loop function of sorts:

df_of_month_day1
df_of_month_day2
df_of_month_day3
df_of_month_day4
df_of_month_day5
df_of_month_day6

etc?.... I've been looking for multiple way sand it's not working.

smci
  • 32,567
  • 20
  • 113
  • 146
noxwei
  • 13
  • 2
  • 4
    strongly recommend that you don't do this. Instead look into `group_by` in `dplyr` or at least store these dataframes as elements of a list. – Calum You Aug 02 '18 at 21:20
  • 2
    Maybe you're looking for something like `split()`? The result of splitting a data.frame would be a list of data.frames. – aosmith Aug 02 '18 at 21:23
  • You can easily do this with `lapply/sapply(...)` and the `assign()` function to construct the variable names; see existing questions on `assign()`. But it's worse than simply creating an array `df_of_month_day[1:6]`. Unless you can articulate a compelling reason *why* you need to do this, this question should be closed as unclear. – smci Aug 02 '18 at 22:14
  • Thanks!! I'm new at this and this thread has been super helpful. I'm coming from a C++ background and doesnt' really know how to approach this field yet. – noxwei Aug 03 '18 at 19:08

1 Answers1

0

To give you an answer to your question, you would achieve this with lapply. For instance, consider the following:

Create some sample data:

df <- data.frame(Day = rep(seq.Date(from = as.Date('2010-01-01'), to = as.Date('2010-01-30'), by =1), 5))
df$somevar <- rnorm(nrow(df))
head(df)
         Day      somevar
1 2010-01-01 -0.946059466
2 2010-01-02  0.005897001
3 2010-01-03 -0.297566286
4 2010-01-04 -0.637562495
5 2010-01-05 -0.549800912
6 2010-01-06  0.287709994

Now, observe that unique can give you a vector with all unique dates:

unique(df$Day)
 [1] "2010-01-01" "2010-01-02" "2010-01-03" "2010-01-04" "2010-01-05" "2010-01-06" "2010-01-07" "2010-01-08" "2010-01-09" "2010-01-10"
[11] "2010-01-11" "2010-01-12" "2010-01-13" "2010-01-14" "2010-01-15" "2010-01-16" "2010-01-17" "2010-01-18" "2010-01-19" "2010-01-20"
[21] "2010-01-21" "2010-01-22" "2010-01-23" "2010-01-24" "2010-01-25" "2010-01-26" "2010-01-27" "2010-01-28" "2010-01-29" "2010-01-30"

This you can pass to lapply to be used for subsetting:

lapply(unique(df$Day), function(x) df[df[,"Day"]==x,])
[[1]]
           Day    somevar
1   2010-01-01 -0.9460595
31  2010-01-01 -0.3434005
61  2010-01-01 -1.5463641
91  2010-01-01 -0.5192375
121 2010-01-01 -1.1780619

[[2]]
           Day      somevar
2   2010-01-02  0.005897001
32  2010-01-02 -1.346336688
62  2010-01-02 -0.321702391
92  2010-01-02 -0.384277955
122 2010-01-02  0.058906305

... (output omitted)

where the output of lapply is a list with the corresponding dataframes.

Needless to say, you would assign this to a name to capture all dataframes in a list as in mylist <- lapply(...). However, if you want to have them in your global environment, you can first give each dataframe a name, for instance using setNames as in setNames(mylist, paste0("df", format(unique(df$Day), format = "%Y%m%d"))) and then you could use list2env(mylist) to push each list element into the global environment.

However, as mentioned in the comments, this is probably not a good idea. If you want to do something to each date, consider the group-by solution with dplyr: For instance, imagine you want to get the mean by date:

library(dplyr)
df %>% group_by(Day) %>% summarize(mean_var = mean(somevar))
# A tibble: 30 x 2
   Day        mean_var
   <date>        <dbl>
 1 2010-01-01  -0.907 
 2 2010-01-02  -0.398 
 3 2010-01-03   0.213 
 4 2010-01-04  -0.142 
 5 2010-01-05  -0.377 
 6 2010-01-06   0.404 
 7 2010-01-07  -0.634 
 8 2010-01-08   1.00  
 9 2010-01-09   0.378 
10 2010-01-10  -0.0863
# ... with 20 more rows

where each row corresponds to the group-wise mean. This is called split-apply-combine and is worthwhile googling. It will come again and again.

Just for reference, in base R, you could achieve this using e.g. by, as in

by(df$somevar, df$Day, FUN = mean)

though either dplyr or data.table are probably more user-friendly.

coffeinjunky
  • 11,254
  • 39
  • 57
  • Thank you so much caffeinjunky!! i'm looking up split-apply-combine right now, and it's useful! I am still trying to get a grasps of the map on how to approach this workflow for example the terms like split-apply-combine would never mean anything to me until now. But i found a good resource for it!! http://stat545.com/block024_group-nest-split-map.html – noxwei Aug 03 '18 at 20:51
  • Yes, that is a good source. I have actually written many answers pushing for a dataframe-based group_by approach. See e.g. https://stackoverflow.com/questions/36315163/dplyr-count-number-of-one-specific-value-of-variable/36315247#36315247, https://stackoverflow.com/questions/45828581/using-conditions-in-group-by-summarize-loop/45829235#45829235, https://stackoverflow.com/questions/37395059/running-several-linear-regressions-from-a-single-dataframe-in-r/37401209#37401209 For me this is one of the most useful things of `dplyr`. – coffeinjunky Aug 03 '18 at 20:57