0

I have a set of survey design data for each quarter/year in RDs format on my disk. The data is like this:

Year  Quarter  Age
2010     1     27
2010     1     32 
2010     1     34
...

I'm using the function svymean(formula=~Age, na.rm = T, design = data20101) to estimate the mean of the age variable for each year/quarter file. I would like to run this more efficiently in a way that I could run the function in a loop and then save the results in one single data frame.

The output I'm looking for is to produce such a dataframe:

Year  Quarter  Mean_Age
2010     1       31.1
2010     1       32.4 
2010     1       30.9
2010     1       34.5
2010     2       36.3
2010     2       31.2
2010     2       30.8
2010     2       35.6
...

Regards,

ph_9933
  • 97
  • 8
  • 1
    Instead of having individual yet identically-structured frames on which you do the same processing, it is usually better to keep them as a [list of frames](https://stackoverflow.com/a/24376207/3358227). With a list of frames, one could do possibly `newlist <- lapply(list_of_frames, function(F) svymean(formula=~Age, na.rm=TRUE, design=F))`. – r2evans Nov 12 '21 at 19:56
  • In this case, data is large to use a list of frames (+-1Gb/each) – ph_9933 Nov 12 '21 at 20:01
  • 1
    I don't see the problem yet: if you can store the *objects* in R, then you can store a `list` of said objects in R (though I understand that shifting from one approach to the other may result in duplicates in memory, something you likely cannot easily afford atm). The preferred (even canonical) approach of using lists-of-frames applies to reading in the data in the first place as well; there are many related questions where the answer includes some variation of `list_of_frames <- lapply(filenames, read.csv)`, so it starts in the right format. – r2evans Nov 12 '21 at 20:09

1 Answers1

1

I don't have enough rep to comment. I see r2evans is making good suggestions as to how you might read in your big data. You will obviously need to list the data in some way if you are to iterate through it. This method iterates through the list of filenames given your data is all in one directory by itself. It also does not save more than one dataset at a time which is ideal if the only thing you want is the output/grouped mean ages (not ideal if you are running more analysis besides this). I'm not sure what was most pressing from your question, but below is a general model of how to approach your problem.

library(dplyr)
output <- data.frame(Year = numeric(),
                     Quarter = numeric(),
                     Mean_Age = numeric())
filepath <- "./filpath_to_data/"
files_list <- list.files(filepath)
for (i in 1:length(files_list)){
  output <- read.csv(paste0(filepath, files_list[i])) %>%
    group_by(Year, Quarter) %>%
    summarise(Mean_Age = mean(Age), .groups = "drop") %>%
    add_rows(output)
}
output   
Chad S
  • 53
  • 6
  • Why are you using a super-assign? And why aren't you putting everything in one pipe? (I'm just curious...) – Martin Gal Nov 12 '21 at 20:19
  • 1
    Hm, I haven't used R in a while so I may have recalled incorrectly. Will output save the added rows from each iteration with a normal assign? I can update – Chad S Nov 12 '21 at 20:21
  • If you don't put it into an extra environment (like inside a function), a normal assign should be sufficient. – Martin Gal Nov 12 '21 at 20:23