Read an irregular text data file into R

Question

I am trying to "import" data from a non-data.frame shape text file with multiple precipitation rates reports. The reports are all equal, a sample of one is the following:

  I D E A M  -  INSTITUTO DE HIDROLOGIA, METEOROLOGIA Y ESTUDIOS AMBIENTALES
                                                                                                          INFORMATION SYSTEM
                                  PRECIPITATION TOTAL VALUES (mms)                              NATIONAL ENVIRONMENTAL 

DATE OF PROCESS :  2015/09/15                    YEAR  1980                              STATION ID : 11010010  VUELTA LA

LAT    0527 N               TIPO EST    PM                   STATE      CHOCO                   INSTALLATION DATE   1943-ENE
LON   7632 W               ENT     01  IDEAM            CITY  LLORO                   FECHA-SUSPENSION
ELE   100 m.s.n.m         REGIONAL    01  ANTIOQUIA        CORRIENTE  ANDAGUEDA


      DAY       JAN *  FEB *  MAR *  APR *  MAY  *  JUN *  JUL *  AGO *  SEP *  OCT *  NOV *  DEC *


       01                 30.0       .0       .0      3.0     80.0       .0      3.0       .0     35.0     88.0      1.0
       02                   .0      1.0       .0      1.0    100.0       .0       .0      6.0      1.0     65.0     69.0
       03                 35.0    100.0       .0     10.0       .0       .0       .0     70.0     40.0     42.0     16.0
       04                   .0       .0     80.0      3.0    140.0      8.0       .0    135.0     20.0     48.0     15.0
       05                   .0       .0       .0      8.0      3.0     20.0      4.0     19.0     80.0       .0     20.0
       06                   .0       .0    100.0    138.0       .0      6.0       .0      4.0     20.0       .0     10.0
       07                 31.0     10.0       .0     30.0     15.0     50.0      6.0       .0      4.0       .0       .0
       08                   .0     44.0       .0     10.0     40.0       .0       .0       .0      7.0       .0      4.0
       09                 35.0      3.0     23.0       .0     20.0    140.0       .0      6.0       .0     32.0     16.0
       10                   .0     75.0       .0       .0     60.0       .0       .0     23.0      3.0      1.0      5.0
       11                   .0     17.0       .0     15.0     80.0       .0       .0     80.0       .0       .0      3.0
       12                   .0     75.0       .0      8.0       .0     63.0     10.0       .0       .0     17.0     10.0
       13                   .0     20.0       .0     60.0       .0       .0       .0    110.0     50.0      3.0     25.0
       14                 55.0       .0     26.0     12.0       .0      3.0    140.0      4.0     74.0       .0     38.0
       15                   .0       .0      3.0      7.0     10.0       .0      6.0       .0     35.0     12.0     27.0
       16                   .0      4.0     89.0     20.0      3.0       .0       .0     10.0       .0       .0       .0
       17                 45.0       .0      9.0       .0     30.0       .0      2.0       .0     60.0    103.0       .0
       18                 30.0       .0       .0       .0     21.0       .0     20.0     15.0       .0       .0       .0
       19                   .0    130.0       .0     10.0     12.0      8.0       .0      3.0     20.0     49.0     40.0
       20                 45.0       .0     25.0    190.0       .0     38.0      8.0       .0      8.0      3.0      1.0
       21                  1.0       .0     45.0     50.0       .0     35.0       .0      2.0     13.0      1.0      4.0
       22                   .0       .0     20.0       .0       .0       .0       .0     16.0     10.0     12.0     50.0
       23                 40.0       .0     40.0     16.0       .0     30.0       .0     13.0      2.0    106.0     10.0
       24                   .0       .0     45.0     60.0       .0      3.0       .0     25.0       .0     16.0       .0
       25                   .0       .0       .0       .0     18.0     10.0       .0      3.0       .0     50.0     20.0
       26                 10.0       .0       .0       .0      9.0      6.0     20.0     20.0      6.0     15.0      3.0
       27                   .0    135.0     60.0     40.0     80.0     15.0       .0     18.0     10.0     77.0       .0
       28                 10.0       .0      9.0     15.0       .0       .0       .0      6.0     72.0    102.0       .0
       29                 23.0      6.0       .0       .0       .0       .0       .0     23.0       .0     34.0       .0
       30                            .0     10.0       .0     20.0      3.0       .0     64.0     14.0    111.0       .0
       31                            .0              31.0              10.0       .0                .0                .0


                                  ***  ANNUAL VALUES  ***

                                 TOTAL                  6954.0
                                 No DE RAIN DAYS         210
                                 MAX 24 Hrs        190.0

The text file includes one report after the other, all with the same header "I D E A M - INSTITUTO DE HIDROLOGIA, METEOROLOGIA Y ESTUDIOS AMBIENTALES". I have already "read" the text file using the readLines()function and I was hoping to create a data frame with the information of each report, something like this:

DATE        STATION_ID  LAT    LON    ELE CITY STATE PRECIPITATION
01/JAN/1980 11010010    0527 N 7632 W 100 LLORO CHOCO 0

I have been trying split each report and then start to parse each line. Unfortunately is a slow process. I understand this page looks for delimited questions, but I am kind of stuck.

Thanks in advance.

`readLines` is a great place to start. Read each line in as a string, then process from there. — Gregor Thomas, Oct 01 '15 at 21:20
Thank you very much Gregor. I will start to readLines and try to parse and scan for values. Any other advice will be highly grateful. — topcat, Oct 01 '15 at 21:37

score 3 · Accepted Answer · edited May 23 '17 at 10:28

Here's one way to do it.

Use readLines() to read in the full page, 56 lines.
Determine the information from the header by knowing the line numbers and positions in the line for latitude, longitude, elevation, city, state, and year. Use substr()
Using the year obtained there, write out all the dates of that year. cbind that with the header information.
Use a function that takes the day of month and month number, and locates the corresponding precipitation on the page. Line number is 14 + dayOfMonth, horizontal offset can be a vector with 12 numbers, one for each month. Add that column to your page.

If you rbind each page as you go through, you will end up with a long (!) tidy dataset. [edit] You will also spend an eternity as memory is managed if your dataset is large. Instead you can create a list of dataframes and bind them all at the end. See this question and this question for more information.

Here is some code I came up with: you can test it on a short extract first.

library("lubridate")
raw2page <- function(rawdata) {
# Takes a vector of chars, one page of data, returns a tidy dataframe
# Template for the page header
yearbound <- c(5,60,63)
stationbound <- c(5,105,112)
latbound <- c(7,16,19)
longbound <- c(8,16,19)
deptobound <- c(7,81,101)
municipiobound <- c(8,81,101)

framebounds <- rbind(yearbound,stationbound,latbound,longbound,deptobound,municipiobound)
colnames(framebounds) <- c("line","start","end")
framebounds <- as.data.frame(framebounds)

framedata <- data.frame()
framedata <- as.data.frame(rbind(with(framebounds, substr(rawdata[line],start,end))))
colnames(framedata) <- c("year","station","latitude","longitude","depto","municipio")
trim <- function (x) gsub("^\\s+|\\s+$", "", x)
framedata$depto <- trim(framedata$depto)
framedata$municipio <- trim(framedata$municipio)

# Make a column listing all dates of the year
st <- as.Date(paste(framedata[1]$year,"-01-01",sep=""))
en <- as.Date(paste(framedata[1]$year,"-12-31",sep=""))
date <- seq(as.Date(st),as.Date(en), by=1)
pagedata <- cbind(framedata,date)

# horizontal offsets for the last digit of each month (the last digit is aligned)
mboundaries<-c(25,34,43,52,61,70,79,88,97,106,115,124)
# now we can take the dates we generated before and use these coordinates to read the rainfall amount into a vector
rainfall <- as.numeric(substr(rawdata[14+mday(pagedata$date)],mboundaries[month(pagedata$date)]-6,mboundaries[month(pagedata$date)] ))
# and bind the vector to the page data to make a tidy data set 
page <- cbind(pagedata,rainfall)
page
}

raw <- readLines("area1.txt") # read in all the data

# Get all the page header line numbers
headers <- as.data.frame(grep("HIDROLOGIA", raw))
colnames(headers) <- c("linenum")

listOfDataFrames <- vector(mode = "list", length = nrow(headers))

# page by page, append onto the list
output <- data.frame()
for (i in 1:nrow(headers)) {
  start <- headers[i,]
  end <- start + 56
  listOfDataFrames[[i]] <- raw2page(raw[start:end])
      }
library("plyr")
output <- rbind.fill(listOfDataFrames)
print(summary(output))

@dervinmcgeary thank you very much! I have been trying the code you made, and I have some doubts with the extraction of rain data. Can you explain me a little more how you isolate the day column from the rain data columns? I am lost with the `mday` part. I was thinking extract solely a day vector with something like: `days <- as.numeric(substr(rawdata[dayboundaries], start, end))`, which is alike with the first part (the header data). — topcat, Oct 02 '15 at 18:10
Of course! Actually I took a long time thinking about how to do this because the lengths of the months are irregular and inconsistent (leap years). `mday(pagedata$date)` returns the day of the month when you give it a day object (which is our **vertical** coordinate: day 2, the 2nd, is row 14+2 = 16). `month` returns the month number, and we use that to select the **horizontal** offset. To be clear, I'm completely ignoring the day numbers in the original file, because we can get the days of the year automatically. — Derwin McGeary, Oct 02 '15 at 18:17

Read an irregular text data file into R

1 Answers1