0

I need some help creating a dataset in R where each observation contains a latitude, longitude, and date. Right now, I have a list of roughly 2,000 files gridded by lat/long, and each file contains observations for one date. Ultimately, what I need to do, is combine all of these files into one file where each observation contains a date variable that is pulled from the name of its file.

So for instance, a file is named "MERRA2_400.tavg1_2d_flx_Nx.20120217.SUB.nc". I want all observations from that file to contain a date variable for 02/17/2012.

That "nc" extension describes a netCDF file, which can be read into R as follows:

library(RNetCDF)
setwd("~/Desktop/Thesis Data")
p1a<-"MERRA2_300.tavg1_2d_flx_Nx.20050101.SUB.nc"
pid<-open.nc(p1a)
dat<-read.nc(pid)

I know the ldply command can by useful for extracting and designating a new variable from the file name. But I need to create a loop that combines all the files in the 'Thesis Data' folder above (set as my wd), and gives them date variables in the process.

I have been attempting this using two separate loops. The first loop uploads files one by one, creates a date variable from the file name, and then resaves them into a new folder. The second loop concatenates all files in that new folder. I have had little luck with this strategy.

view[dat]

As you can hopefully see in this picture, which describes the data file uploaded above, each file contains a time variable, but that time variable has one observation, which is 690, in each file. So I could replace that variable with the date within the file name, or I could create a new variable - either works.

Any help would be much appreciated!

bricevk
  • 197
  • 8

1 Answers1

0

I do not have any experience working with .nc files, but what I think you need to do, in broad strokes, is this:

filenames <- list.files(path = ".") # Creates a character vector of all file names in working directory

Creating empty dataframe with column names:

final_data <- data.frame(matrix(ncol = ..., nrow = 0)) # enter number of columns you will have in the final dataset
colnames(final_data) <- c("...", "...", "...", ...) # create column names

For each filename, read in file, create date column and write as object in global environment:

for (i in filenames) {
  pid<-open.nc(i)
  dat<-read.nc(pid) 

  date <- ... # use regex to get your date from i and convert it into date

  dat$date <- date

  final_data <- rbind(final_data, dat)
}
denisafonin
  • 1,116
  • 1
  • 7
  • 16
  • try with "date <- i" at first. It will create a column with file name. If you like the output, then spend some time figuring out regex. I am quite terrible at it – denisafonin Apr 06 '20 at 14:27
  • Hey, thanks sp much for the help! So one small tweak I think is necessary is changing "read.nc(i)" to "read.nc(pid)", but that may be beside the point. --- I am a little confused by the last line of the loop. I am not sure if the datasets are properly being combined into one dataset. Is there a way I can check that? If I do "list(dat$date)" I only get one of the file names. I am not sure if the files are concatenating such that the date variable is always the observations original file name. Other than that, the loop worked well. I see what you're doing and its very helpful – bricevk Apr 06 '20 at 15:53
  • yep, read.nc(dat). And to add them all into one dataframe, I first create empty data frame with column names (note that column names have to be the same as in individual files, and to include the one you create in the loop, 'date'). And then I amened last line to bind rows to this empty data frame at each iteration. See my updated version – denisafonin Apr 06 '20 at 16:12
  • That makes a lot of sense, thank you! One step closer, but I've run into another small snag. I got the error message: `Error in rbind(deparse.level, ...) : invalid list argument: all variables should have the same length` --- This problem probably has to do with my data, but some guidance would still be helpful if you have any. I think the issue is that each data file has a different number of observations of each variable, so it has a hard time combining each as a row. Refer to the screenshot in the original post. 22 values for longitude, 19 for latitude, thus 19*22 for temp & precip – bricevk Apr 06 '20 at 17:57
  • yes, that's an issue.. I think you should fill in empty cells with NAs to make sure each column has the same number of rows. Maybe try this rowr::cbind.fill(final_data, dat, fill = NA) instead of rbind(...) - https://stackoverflow.com/questions/44180030/how-to-append-a-column-with-different-row-count-into-a-data-frame-in-r – denisafonin Apr 07 '20 at 10:36