Read the data efficiently with multiple separating lines in R

Question

I have a sample dataset like this:

 8  02-Model (Minimum)
250.04167175293  17.4996566772461
250.08332824707  17.5000038146973
250.125  17.5008907318115
250.16667175293  17.5011672973633
250.20832824707  17.5013771057129
250.25   17.502140045166
250.29167175293  17.5025615692139
250.33332824707  17.5016822814941
 7  03 (Maximum)
250.04167175293  17.5020561218262
250.08332824707  17.501148223877
250.125  17.501127243042
250.16667175293  17.5012378692627
250.20832824707  17.5016021728516
250.25   17.5024681091309
250.29167175293  17.5043239593506

The first column on the data file means the number of rows for that particular data (i.e for 02-MOdel (minimum)). Then after 8 lines I have another line 7 03 (Maximum) which means for 03 (Maximum) I will have 7 lines of data.

The function I have written is as follows:

readts <- function(x)
{
  path <- x
  # Read the first line of the file
  hello1 <- read.table(path, header = F, nrows = 1,sep="\t")
  tmp1 <- hello1$V1
  # Read the data below first line
  hello2 <- read.table(path, header = F, nrows = (tmp1), skip = 1, 
                       col.names = c("Time", "value"))
  hello2$name <- c(as.character(hello1$V2))
  # Read data for the second chunk
  hello3 <- read.table(path, header = F, skip = (tmp1 + 1), 
                       nrows = 1,sep="\t")
  tmp2 <- hello3$V1
  hello4 <- read.table(path, header = F, skip = (tmp1 + 2), 
                       col.names = c("Time", "value"),nrows=tmp2)
  hello4$name <- c(as.character(hello3$V2))
  # Combine data to create a dataframe
  df <- rbind(hello2, hello4)
  return(df)
}

The output I get is as follows:

> readts("jdtrial.txt")
       Time    value               name
1  250.0417 17.49966 02-Model (Minimum)
2  250.0833 17.50000 02-Model (Minimum)
3  250.1250 17.50089 02-Model (Minimum)
4  250.1667 17.50117 02-Model (Minimum)
5  250.2083 17.50138 02-Model (Minimum)
6  250.2500 17.50214 02-Model (Minimum)
7  250.2917 17.50256 02-Model (Minimum)
8  250.3333 17.50168 02-Model (Minimum)
9  250.0417 17.50206       03 (Maximum)
10 250.0833 17.50115       03 (Maximum)
11 250.1250 17.50113       03 (Maximum)
12 250.1667 17.50124       03 (Maximum)
13 250.2083 17.50160       03 (Maximum)
14 250.2500 17.50247       03 (Maximum)
15 250.2917 17.50432       03 (Maximum)

jdtrial.txt is the data I have shown above. However, when I have large data with multiple separators, my function doesn't work and I need to add more lines which makes the function more messy. Is there any easier method to read a data file like this? Thanks.

The expected data is the data that I got. The data you can try with:

 8  02-Model (Minimum)
250.04167175293  17.4996566772461
250.08332824707  17.5000038146973
250.125  17.5008907318115
250.16667175293  17.5011672973633
250.20832824707  17.5013771057129
250.25   17.502140045166
250.29167175293  17.5025615692139
250.33332824707  17.5016822814941
 7  03 (Maximum)
250.04167175293  17.5020561218262
250.08332824707  17.501148223877
250.125  17.501127243042
250.16667175293  17.5012378692627
250.20832824707  17.5016021728516
250.25   17.5024681091309
250.29167175293  17.5043239593506
 8  04-Model (Maximum)
250.04167175293  17.5020561218262
250.08332824707  17.501148223877
250.125  17.501127243042
250.16667175293  17.5012378692627
250.20832824707  17.5016021728516
250.25   17.5024681091309
250.29167175293  17.5043239593506
250.33332824707  17.5055828094482

G. Grothendieck · Accepted Answer · 2013-07-12T03:31:35.810

Its not clear what multiple separators refers to but here is a solution that addresses the data you actually showed.

Read in the data using using fill=TRUE to fill in empty fields. Keep track of which rows are headers using is.hdr. Convert V2 to numeric (replacing V2 with NA in the header rows so they do not generate a warning). Then replace non-header rows with NAs in the next two columns and use na.locf (link) to fill in the NAs with the headers. Finally, only keep non-header rows.

library(zoo)
DF <- read.table("jdtrial.txt", fill = TRUE, as.is = TRUE)

is.hdr <- DF$V3 != ""
transform(DF, 
    V2 = as.numeric(replace(V2, is.hdr, NA)),
    V3 = na.locf(ifelse(is.hdr, V2, NA)),
    name = na.locf(ifelse(is.hdr, V3, NA)))[!is.hdr, ]

The result of the last statement is:

         V1       V2       V3      name
2  250.0417 17.49966 02-Model (Minimum)
3  250.0833 17.50000 02-Model (Minimum)
4  250.1250 17.50089 02-Model (Minimum)
5  250.1667 17.50117 02-Model (Minimum)
6  250.2083 17.50138 02-Model (Minimum)
7  250.2500 17.50214 02-Model (Minimum)
8  250.2917 17.50256 02-Model (Minimum)
9  250.3333 17.50168 02-Model (Minimum)
11 250.0417 17.50206       03 (Maximum)
12 250.0833 17.50115       03 (Maximum)
13 250.1250 17.50113       03 (Maximum)
14 250.1667 17.50124       03 (Maximum)
15 250.2083 17.50160       03 (Maximum)
16 250.2500 17.50247       03 (Maximum)
17 250.2917 17.50432       03 (Maximum)
19 250.0417 17.50206 04-Model (Maximum)
20 250.0833 17.50115 04-Model (Maximum)
21 250.1250 17.50113 04-Model (Maximum)
22 250.1667 17.50124 04-Model (Maximum)
23 250.2083 17.50160 04-Model (Maximum)
24 250.2500 17.50247 04-Model (Maximum)
25 250.2917 17.50432 04-Model (Maximum)
26 250.3333 17.50558 04-Model (Maximum)

Nice. This is by far the best option. Everyone has overlooked the `fill=TRUE` argument. — thelatemail, Jul 12 '13 at 03:12
It seems short and handy but I am not familiar with zoo package. Your explanation is helpful though. — Jd Baba, Jul 12 '13 at 03:25

A5C1D2H2I1M1N2O1R2T1 · Answer 2 · 2013-07-12T03:23:48.000

Here's a function that seems to work on your sample data. It returns a list of data.frames, but you can use do.call(rbind, ...) to get a single data.frame if you prefer.

myFun <- function(textfile) {
  # Read the lines of your text file
  x <- readLines(textfile)
  # Identify lines that start with space followed
  #  by numbers followed by space followed by
  #  numbers. By the looks of it, matching the
  #  space at the start of the line might be
  #  sufficient at this stage.
  myMatch <- grep("^\\s[0-9]+\\s+[0-9]+", x)
  # Extract the first number, which tells us how
  #  many values need to be read in.
  scanVals <- as.numeric(gsub("^\\s+([0-9]+)\\s+.*", 
                              "\\1", x[myMatch]))
  # Extract. I've used seq_along which is like 
  #  1:length(myMatch)
  temp <- lapply(seq_along(myMatch), function(y) {
    # scan will return just a single vector, but your
    #  data are in pairs, so we convert the vector to
    #  a matrix filled in by row
    t1 <- matrix(scan(textfile, skip = myMatch[y], 
                      n = scanVals[y]*2), ncol = 2, 
                 byrow = TRUE)
    # Add column names to the matrix
    colnames(t1) <- c("time", "value")
    # Convert the matrix to a data.frame and add the 
    #  name column using cbind.
    cbind(data.frame(t1), 
          name = gsub("^\\s+([0-9]+)\\s+(.*)", "\\2", 
                      x[myMatch])[y])
  })
  # Return the list we just created
  temp
}

Example usage would be:

myFun("mytest.txt")                  ## list output

or

do.call(rbind, myFun("mytest.txt"))  ## Single data.frame

THank you so much. Works perfect but I am trying to understand. — Jd Baba, Jul 12 '13 at 02:43

Hong Ooi · Answer 3 · 2013-07-12T02:39:16.347

1

Read the data using readLines, and then do each chunk of data in sequence. This avoids having to make assumptions about the model name or fiddling with regexes. You do have to use a loop as opposed to [sl]apply, but really, there's nothing wrong with that.

readFile <- function(file)
{
    con <- readLines(file)
    i <- 1
    chunks <- list()
    while(i < length(con))
    {
        type <- scan(text=con[i], what=character(2), sep="\t")
        nlines <- as.numeric(type[1])
        dat <- cbind(read.delim(text=con[i+seq_len(nlines)], header=FALSE),
                     type=type[2])
        chunks <- c(chunks, list(dat))
        i <- i + nlines + 1
    }
    do.call(rbind, chunks)
}

edited Jul 12 '13 at 02:39

answered Jul 12 '13 at 02:31

Hong Ooi

56,353
13
134
187

I got error using your function: `> readFile("trial2.txt") Read 1 item Error in seq_len(nlines) : argument must be coercible to non-negative integer In addition: Warning message: NAs introduced by coercion ` – Jd Baba Jul 12 '13 at 02:46
I'm assuming your data has tab delimiters, as implied by your post. Are you running it on input where the tabs have been converted to spaces (eg by cutting and pasting from/to SO)? – Hong Ooi Jul 12 '13 at 02:55

thelatemail · Answer 4 · 2013-07-12T03:25:50.090

Edit to replace my original answer in light of @G.Grothendieck's far better answer. This is largely a variation on that answer.

Another go, where for the purposes of demonstration, test is just the raw text like:

test <-" 1  02-Model (Minimum)
250.04167175293  17.4996566772461
 1  03 (Maximum)
250.04167175293  17.5020561218262
 1  04-Model (Maximum)
250.04167175293  17.5020561218262"

Process it:

interm <- read.table(
  text = test, fill = TRUE, as.is = TRUE,
  col.names=c("Time","Value","Name")
)

keys <- which(interm$Name != "")

interm$Name <- rep(
  apply(interm[keys,][-1],1,paste0,collapse=""), 
  diff(c(keys,nrow(interm)+1))
)

result <- interm[-(keys),]

Result:

      Time            Value              Name
2 250.0417 17.4996566772461 02-Model(Minimum)
4 250.0417 17.5020561218262       03(Maximum)
6 250.0417 17.5020561218262 04-Model(Maximum)

Read the data efficiently with multiple separating lines in R

4 Answers4

Linked