0

I was building a data import from climatology files .dly* and while I was happy that it's working it's completely unfeasible to run it on the whole dataset I have.

*.dly files are highly structured flat files such as here: https://data.noaa.gov/dataset/global-historical-climatology-network-daily-ghcn-daily-version-3 (NCEI FTP Server -> GSN -> any file there.)

#Prepare data
ColWidths <- c(11,4,2,4,rep(c(5,1,1,1),31))

#Check if Data frame exists
if (exists("Data_Raw") && is.data.frame(get("Data_Raw"))==TRUE)  {remove(Data_Raw)}

dailies <- list.files(pattern="*.dly")
count <- 1
for (file in dailies) {

#Import .dly to DataFrame
if(!exists("Data_Raw")) {  #If it doesn't
    Data_Raw <- read.fwf(file, widths = ColWidths
                      ,header = FALSE
                      )
    }  else { #If does
    temp_Data <- read.fwf(file, widths = ColWidths
                     ,header = FALSE
                    )
    Data_Raw <- rbind(Data_Raw, temp_Data)
    count <- count + 1
    rm(temp_Data)
    }    
}

However, my tests show that this code imported 670 files in one hour. The problem is that the whole database is 100k files. Therefore this solution would take too much time.

The other problem is that 670 already generated 633k rows and after transforming them into more useful frame (.dly files have years/months in rows and days in columns -> to have months and dates in rows) I have already 19.6M (633*31) rows. Therefore the whole db would be roughly 3B rows. Which is scary.

My ideas were to:

  • Instead of importing data to data frame, try to use R to transfer it from .dly file to SQL table e.g. using package (RSQLite)

  • Invest a bit into cloud computing solution like AWS to calculate it once (Although, I don't know much about it as I've never used it)

  • Accept that R won't be able to process this much data and rather learn some Python libraries to build it into SQL table.

  • Accept that it's even too much for SQL to handle (If I would need to analyze efficiently) and learn some basic Hadoop and start from there.

That's my home pet project therefore I am using only open source software, like:

  • RStudio
  • PgAdminIII
  • & my computer has only i3 3.2GHz, 4GB ram on OS X El Captain

I would be happy If you could give me some hints how to handle this topic. If the proper solution requires me learning new things, it's ok as I don't have any deadline for it.

Thanks for all answers, Matt

EDIT: 3rd December 2016

After reading the post suggested in the comment I've rebuilt my code and tried 3 solutions for import: data.table, read_fwf, Laf. The fastest time had read_fwf. I've also used rbindlist instead of rbind as it looks to be faster. The enhancement is 20 times faster so it has imported 12.6M rows (compare to 633 in previous case) in one hour. However, it's still not enough to import the whole db which is around 760M. Therefore I will try to combine all the .dly files first into one big file and then used R to import it, as I think that going from one file to another takes more time than actually reading them.

new code:

require(readr)
require(data.table)

#Prepare data
ColWidths <- c(11,4,2,4,rep(c(5,1,1,1),31))

#Import .dly to DataFrame
if (exists("Data_Raw") && is.data.frame(get("Data_Raw"))==TRUE)  {remove(Data_Raw)}

dailies <- list.files(pattern="*.dly")
count <- 1

system.time(

for (file in dailies) {

#check if Data frame exists
if(!exists("Data_Raw")) {  #If it doesn't
    Data_Raw <- read_fwf(file, fwf_widths(ColWidths))
}  else { #If does

    temp_Data <- read_fwf(file, fwf_widths(ColWidths))
    Data_Raw <- rbindlist(list(Data_Raw, temp_Data))

    count <- count + 1
    rm(temp_Data)

   } #close if exist

 } #close for i loop

) # close after system time 

Matt

Mateusz Konopelski
  • 1,000
  • 4
  • 20
  • 37
  • 1
    See the solution to [this question](https://stackoverflow.com/questions/24715894/faster-way-to-read-fixed-width-files-in-r) on SO. It might help with the data import. Unless it's the `rbind` function eating your time [ Executive summary : use LaF or iotools package ] – ekstroem Nov 29 '16 at 20:26
  • Thanks, I will read carefully that thread. Also I will test what takes more time. Import or bonding. – Mateusz Konopelski Nov 29 '16 at 20:31

0 Answers0