2

I am trying to merge >30,000 individual .dat files of climate data across the UK (with sequential names of met*****.dat) into one single data file that I can then extract annual values from. The data is formatted such that the first row in each file contains the lat long for that cell and then underneath that are the climate variables measured:

54.78   -7.06
 iGrid     x      y      year     month   SRad        Tmin       Tmax          vp        wind        rain        sunshine
19028      56     139    1971       1        2.22        1.25        6.43        7.29       12.33      133.47        1.72
19028      56     139    1971       2        3.94        1.26        6.74        7.36       11.11      106.64        1.85
19028      56     139    1971       3        7.30        0.95        7.56        7.05       12.77       57.26        2.57
19028      56     139    1971       4       12.09        2.29       10.07        7.71       11.02      112.81        4.24
19028      56     139    1971       5       15.51        4.85       13.07        9.20       13.94      103.08        5.02
19028      56     139    1971       6       16.18        6.36       14.09        9.98       12.44      115.03        4.76
19028      56     139    1971       7       15.59        9.25       17.56       12.83       10.45      107.18        4.73
19028      56     139    1971       8       11.92        9.20       15.98       12.70       11.39      136.40        3.35
19028      56     139    1971       9        9.69        7.85       16.31       12.72       12.24       41.76        4.11
19028      56     139    1971      10        5.43        6.96       12.88       10.56       16.55       99.72        2.90
19028      56     139    1971      11        2.55        2.78        8.36        8.39       17.98      195.24        1.56
19028      56     139    1971      12        1.51        3.29        8.02        8.68       16.74       54.63        0.89
19028      56     139    1972       1        1.96       -0.30        4.79        6.78       16.55      168.39        1.07
19028      56     139    1972       2        4.23        0.29        5.78        6.97       15.97      114.47        2.33
19028      56     139    1972       3        7.80        0.53        7.54        7.71       14.02      165.09        3.14
19028      56     139    1972       4       12.51        2.80       10.32        8.26       13.39      116.30        4.63
19028      56     139    1972       5       14.49        4.46       11.50        8.98       15.54      187.31        4.17
19028      56     139    1972       6       16.95        5.36       12.90        9.50       12.45      123.17        5.39
19028      56     139    1972       7       15.35        8.61       16.81       12.63        7.41      105.49        4.53
19028      56     139    1972       8       11.40        8.04       15.17       11.66       11.85       86.44        2.88
19028      56     139    1972       9        9.60        5.54       14.03       10.19        9.96       24.70        4.01
19028      56     139    1972      10        5.29        6.04       12.06       10.27       12.92       79.35        2.70
19028      56     139    1972      11        2.49        1.74        7.09        8.02       14.79      169.47        1.43
19028      56     139    1972      12        1.55        1.80        7.45        8.00       17.66      153.31        1.01
19028      56     139    1973       1        1.73        2.16        6.32        7.98       15.25      202.89        0.52
19028      56     139    1973       2        4.08        0.27        5.47        7.17       14.71      126.04        2.07
19028      56     139    1973       3        8.05        0.90        8.30        7.24       13.64       49.84        3.43
19028      56     139    1973       4       12.54        1.98        9.05        7.27       13.02       86.45        4.66
19028      56     139    1973       5       14.55        4.71       12.37        9.41       13.18      102.20        4.22
19028      56     139    1973       6       16.05        7.42       16.17       11.17       10.93       64.45        4.65
19028      56     139    1973       7       13.72        9.66       16.39       12.82        8.16      111.27        3.18
19028      56     139    1973       8       12.23        9.62       17.07       12.93       10.37       97.53        3.62
19028      56     139    1973       9        9.11        7.95       15.02       11.65       12.63      116.15        3.49
19028      56     139    1973      10        5.23        4.51       10.78        9.45       10.27       71.68        2.61
19028      56     139    1973      11        2.37        1.84        7.63        7.91       14.48      162.53        1.17
19028      56     139    1973      12        1.56        1.19        6.79        7.72       14.65      174.89        1.03

Reading previous answers (This and this) has helped me arrive at this code:

library(data.table)

filenames <- list.files(path="/mydirectory",
                    pattern = ".dat", full.names=T)

temp <- lapply(filenames, fread, sep=",")
data <- rbindlist(temp)

write.table(data,"metcombi.dat", sep=",")

Which spits out a huge .dat file (the input files occupy ~2.5GB) but at the end of each file it repeats the column headings and erases the lat long values.

1) How do I stop it from repeating the column headers for each file I'm merging? 2) How can I extract the lat long values from the first row in each file and move them to new columns?

Thank you!

Community
  • 1
  • 1
B.Wel
  • 86
  • 10
  • Please show the actual file instead of this strange structure. Use the `skip` parameter to read the data without the coordinates and then use the `nrow` data to only read the coordinates. Then combine both into one data.table with a `lat` and `long` column and `rbindlist`. – Roland Aug 03 '16 at 11:18
  • I guess the list items are data.tables. Have you tried `data.table::merge`? – Martin Schmelzer Aug 03 '16 at 11:40
  • Hi Roland, I will update my answer as soon as I can with the actual file but other than using dput I don't understand how I can do that. – B.Wel Aug 03 '16 at 11:43
  • You could copy and paste the content of one .dat file...or upload it via dropbox... – Martin Schmelzer Aug 03 '16 at 12:57
  • Added some sample data. Each file has the same lat long row followed by the headers beneath and data covering 1971-2011. – B.Wel Aug 03 '16 at 13:42

2 Answers2

2

Consider expanding your lapply() to do a quick read of first line, read in .dat file into dataframe, and then split lat/long first lines to columns:

filenames <- list.files(path="/mydirectory",
                        pattern=".dat", full.names=TRUE)

dfList <- lapply(filenames, function(f){
    # FIRST LINES
    con <- file(description=f, open="r")
    latlong <- readLines(con, warn=FALSE)[1]
    close(con)

    # DATA FRAME
    df <- read.table(f, skip=1, header=TRUE)
    df$lat <- strsplit(latlong, "\\s+")[[1]][1]
    df$lon <- strsplit(latlong, "\\s+")[[1]][2]

    return(df)
})

data <- rbindlist(dfList)

write.csv(data, "metcombi.dat")
Parfait
  • 104,375
  • 17
  • 94
  • 125
  • Thanks for the help @Parfait. Ran the code and got an error message which I'm attempting to debug: 'Error in read.table(f, skip = 1, header = TRUE) : no lines available in input 4 stop("no lines available in input") 3 read.table(f, skip = 1, header = TRUE) 2 FUN(X[[i]], ...) 1 lapply(filenames, function(f) { con <- file(description = f, open = "r") latlong <- readLines(con, warn = FALSE)[1] close(con) ...' – B.Wel Aug 04 '16 at 07:54
  • In the help for `file` it mentions that "A maximum of 128 connections can be allocated (not necessarily open) at any one time." As I have 35136 input files could this be the issue? – B.Wel Aug 04 '16 at 08:02
  • Many thanks for your help. The code works (the errors were due to a corrupted input file), I just need to get the lat long into the correct columns as currently the lat is going into the long column and no lat data is transcribed across. – B.Wel Aug 04 '16 at 09:53
  • Great! Yes adjust code as needed to accommodate inputs even with `if` logic. Glad I could help. – Parfait Aug 04 '16 at 12:39
  • Figured out the issue. In the `df$lat <- strsplit(latlong, "\\s+")[[1]][1]` and `df$lon <- strsplit(latlong, "\\s+")[[1]][2]` lines the [1] needs to be [2] and [2] needs to be [3] as I think the space prior to the lat long numbers as [1]. Thanks again! – B.Wel Aug 04 '16 at 12:57
0

From what I understand of the problem above, I would suggest parsing the files in two phases:

First, grab the first line from each file, e.g.:

headers <- data.frame()

for (filename in filenames) {
    fp = open(filename)
    headers <- rbind(headers, readLines(fp, n=1))
    close(fp)
}

Next, use your same code above to loop over the files and grab the main data from each of them, but skip the header line using something like header=TRUE or skip=1:

...
temp <- lapply(filenames, fread, sep=",", header=TRUE)
data <- rbindlist(temp)

The only thing left to do is to add the lat,long information back in from the first loop. I'm not sure what the format is, but this may be as simple as a cbind() call on the two data.frames.

Keith Hughitt
  • 4,860
  • 5
  • 49
  • 54