How to merge many databases in R?

Question

I have this huge database from a telescope at the institute where I currently am working, this telescope saves every single day in a file, it takes values for each of the 8 channels it measures every 10 seconds, and every day starts at 00:00 and finishes at 23:59, unless there was a connection error, in which case there are 2 or more files for one single day.

Also, the database has measurement mistakes, missing data, repeated values, etc.

File extensions are .sn1 for days saved in one single file and, .sn1, .sn2, .sn3...... for days saved in multiple files, all the files have the same number of rows and variables, besides that there are 2 formats of databases, one has a sort of a header and it uses the first 5 lines of the file, the other one doesn't have it.

Every month has it's own folder including the days it has, and then this folders are saved in the year they belong to, so for 10 years I'm talking about more than 3000 files, and to be honest I had never worked with .sn1 files before

I have code to merge 2 or a handful of files into 1, but this time I have thousands of files (which is way more then what I've used before and also the reason of why I can't provide a simple example) and I would like to generate a program that would merge all of the files to 1 huge database, so I can get a better sample from it.

I have an Excel extension that would list all the file locations in a specific folder, can I use a list like this to put all the files together?

How big are the files (in MB or GB)? Do you have enough memory to load all of them? — Gregor Thomas, Jan 16 '18 at 17:27
Do you specifically need to use R? It seems to me that your question is "how do I merge a number of SN format databases", and that you don't/shouldn't care how it is done. I am not familiar with R-land; is an SN database native to that? If not, can you give some information about what sort of software writes that format? Is it telescope-specific? — halfer, Jan 16 '18 at 17:48
This is a very broad and general question. In order to increase the likelihood of getting a response, can you break down this question into a series of smaller steps? If you have code to merge a handful of files together, what difficulty are you having to extending this to a larger group? R keeps all of its data in memory and may not be able to a multi GB dataset well. — Dave2e, Jan 16 '18 at 18:08
@gregor a full day would be 10KB, and technically I should have enough memory — b0rr3g0, Jan 16 '18 at 18:30
@halfer I normally use R since it's the language I learned in school, I'm not familiar with other languages, but I wouldn't mind changing it if it makes my problem easier, about the file extension I have no clue where it is used, and I personally had never seen it before, all the cells in a row are separated by spaces — b0rr3g0, Jan 16 '18 at 18:30
OK. Where does the ".sn" format come from? Can you provide references for it on the web? Is that the native format for the telescope? Do we need to know what brand of telescope you are using? — halfer, Jan 16 '18 at 18:33
I don't know how to help you more. If you're looking for the answer "yes, this sounds possible", than here it is. If I were you I'd try to follow one of the solutions at the FAQ on [How to merge a list of data frames](https://stackoverflow.com/q/8091303/903061) for the work of merging, and [How to make a list of data frames](https://stackoverflow.com/a/24376207/903061) for preparing for it. — Gregor Thomas, Jan 16 '18 at 18:40
Yes, it's possible to use the file list from Excel, but unless those directories are very confusing it will probably be easier to just use `list.files()` and pull the names directly into R (as shown in the How to make a list of data frames answer). — Gregor Thomas, Jan 16 '18 at 18:46
For the merging, write whatever merge code you have for two files into a function, `merge_sn`, and then `result = Reduce(merge_sn, list_of_sn_data)`. Start by testing on a few days, and then it's still probably safest to work in batches - maybe do each month (in a for loop) and save the results as you go, then combine all the months. — Gregor Thomas, Jan 16 '18 at 18:48

Gautam · Answer 1 · 2018-01-19T13:05:26.767

Suggestions were too long for a comment, so I'm posting them as an aswer here.

It appears that you are able to read the files into R (at least one at a time) so I'm not getting into that.

Multiple Locations: If you have a list of all the locations, you can search in those locations to give you just the files you need. You mentioned an excel file (let's call it paths.csv - has only one column with the directory locations):

library(data.table)
all_directories <- fread(paths.csv, col.names = "paths")

# Focussing on only .sn1 files to begin with
files_names <- dir(path = all_directories$paths[1], pattern = ".sn1")

# Getting the full path for each file 
file_names <- paste(all_directories$path[1], file_names, sep = "/")

Reading all the files: I created a space-delimited dummy file and gave it the extension ".sn1" - I was able to read it properly with data.table::fread(). If you're able to open the files using notepad or something similar, it should work for you too. Need more information on how the files with different headers can be distinguished from one another - do they follow a naming convention, or have different extensions (appears to be the case). Focusing on the files with 5 rows of headers/other info for now.

read_func <- function(fname){
  dat <- fread(fname, sep = " ", skip = 5)
  dat$file_name <- fname # Add file name as a variable - to use for sorting the big dataset 
}

# Get all files into a list
data_list <- lapply(file_names, read_func)

# Merge list to get one big dataset 
dat <- rdbindlist(data_list, use.names = T, fill = T)

Doing all of the above will give you a dataset for all the files that have the extension ".sn1" in the first directory from your list of directories (paths.csv). You can enclose all of this in a function and use lapply over all the different directories to get a list wherein each element is a dataset of all such files.

To include files with ".sn2", ".sn3" ... extensions you can modify the call as below:

ptrns <- paste(sapply(1:5, function(z) paste(".sn",z,sep = "")), collapse = "|")
# ".sn1|.sn2|.sn3|.sn4|.sn5"
dir(paths[1], pattern = ptrns)

Here's the simplified version that should work for all file extensions in all directories right away - might take some time if the files are too large etc. You may want to consider doing this in chunks instead.

# Assuming only one column with no header. sep is set to ";" since by default fread may treate spaces 
# as separators. You can use any other symbol that is unlikely to be present in the location names 
# We need the output to be a vector so we can use `lapply` without any unwanted behaviour 
paths_vec <- as.character(fread("paths.csv", sep = ";", select = 1, header = F)$V1)

# Get all file names incl. location)
file_names <- unlist(lapply(paths_vec, function(z){
  ptrns <- paste(sapply(1:5, function(q) paste(".sn",q,sep = "")), collapse = "|")
  inter <- dir(z, pattern = ptrns)
  return(paste(z,inter, sep = "/"))
}))

# Get all data in a single data.table using read_func previously defined 
dat <- rbindlist(lapply(file_names, read_func), use.names = T, fill = T)

How to merge many databases in R?

1 Answers1