Import unusually formatted text data using R

Question

I have output data from a piece of equipment. Unfortunately the output data is not organized very well, and I have been writing a code in R to break it down. Essentially the data is a separate list of information (basic descriptive information, and raw data for two different measurements A and B for each time interval) for each subject pasted into one long document. For example:

Date: 01/01/2016
Time: 12:00:00 
Subject: Subject1
A: 
1: 1  2  4  1 
2: 2  1  2  3
3: 1  0  2  7
B:
1: 2  3  0  1
2: 4  1  1  2 
3: 3  5  2  8

Date: 01/01/2016
Time: 12:00:00 
Subject: Subject2   
A: 
1: 8  2  0  1 
2: 9  1  2  7
3: 1  6  2  7
B:
1: 2  3  2  0
2: 6  7  1  2
3: 3  3  2  4

I have written a code in R that works, but is not very elegant, using split(seq_along), for-loops, and do.call (based primarily on this stack overflow question and this blog post).

# First read text file in as a character vector called ‘example’

    scan("example_file.txt", what="character", strip.white=T, sep="\n") -> example

# Separate the header text (before the colon) from the proceeding data
# and make that text name the components of the vector

    regmatches(example, regexpr(example, pattern="[[:alnum:]]+:", useBytes = F)) -> names(example)
    gsub(example, pattern="[[:print:]]+: ", replacement="", useBytes = F)-> example.2

# Then, split character vector into a list based on how many lines are
# dedicated to each subject (in this example, 11 lines); based on SE
# answer cited above

    strsplit(example.2, "([A-Z]:)") -> example.3
    split(as.list(example.3), ceiling(seq_along(example.2)/11)) -> example.4

# Use a for-loop to systematically add the data together for subjects 1
# and 2 for time interval 1, using the method detailed from a blog post
# (cited above)

    my.list <- list()

    for(i in 1:2){
            strsplit(as.character(example.4[[i]][5]), split="[[:blank:]]+") -> A
            strsplit(as.character(example.4[[i]][9]), split="[[:blank:]]+")-> B

            as.vector(c(as.character(example.4[[i]][3]), "A", unlist(A))) -> A_char
            as.vector(c(as.character(example.4[[i]][3]), "B", unlist(B))) -> B_char

            paste(as.character(example.4[[i]][3]), "Measure_A") -> a_name
            paste(as.character(example.4[[i]][3]), "Measure_B") -> b_name

            my.list[[a_name]] <- A_char
            my.list[[b_name]] <- B_char
    }

    final.data <- do.call(rbind, my.list)
    as.data.frame(final.data) -> final.data

    names(final.data) <- c("Subject", "Measure", "V1", "V2", "V3", "V4")

I can extract the data for a single time interval for A and B across all subjects using my code (for example, the lines "1: 1 2 4 1" and "1: 2 3 0 1" above) and put put all the information together in a data frame. Where is gets messy is when I want to do this for all of the time intervals, not just one time interval. I can't figure out how to do this without running separate for-loops for each time interval. I tried doing a for-loop within a for-loop, but that didn't work. I also couldn’t figure out how to do this with the apply()-type functions.

If I only had 3 time intervals, as per this example, this issue wouldn’t be so bad, but my actual data is a lot longer. Any suggestions for a more elegant and concise approach would be appreciated!

P.S. I am aware that the final data frame that the above code gives has redundant row names. However, this is a helpful way of making sure that the final data frame’s subject and measure information lines up with the labels I had applied to earlier R objects.

did you use `>`'s for the blockquote effect or does the data itself have `>`'s at the beginning of the lines? and, are there spaces between the lines? — hrbrmstr, Aug 22 '16 at 18:17
I used `>`'s only for the blockquote effect. The actual data does not have those. There are not spaces between the lines, but my code should still work with blank lines due to the `strip.white=T` argument. — GH28, Aug 22 '16 at 19:22
Don't use blockquote effect - format as code. It will be much clearer. — Gregor Thomas, Aug 22 '16 at 22:05

hrbrmstr · Accepted Answer · 2016-08-23T20:02:22.043

This does everything but the rownames:

lines <- readLines(textConnection("Date: 01/01/2016
Time: 12:00:00
Subject: Subject1
A:
1: 1 2 4 1
2: 2 1 2 3
3: 1 0 2 7
B:
1: 2 3 0 1
2: 4 1 1 2
3: 3 5 2 8
Date: 01/01/2016
Time: 12:00:00
Subject: 2
A:
1: 8 2 0 1
2: 9 1 2 7
3: 1 6 2 7
B:
1: 2 3 2 0
2: 6 7 1 2
3: 3 3 2 4
Date: 01/01/2016
Time: 12:00:00
Subject: 2
A:
1: 8 2 0 1
2: 9 1 2 7
3: 1 6 2 7
B:
1: 2 3 2 0
2: 6 7 1 2

3: 3 3 2 4
3: 3 3 2 4"))

Some libraries we'll need for the non-base R solution:

library(purrr)
library(tibble)
library(tidyr)
library(dplyr)

Trim whitespace and filter out blank lines:

trimws(lines) %>% discard(`==`, "") -> lines

This makes a vector of the indexes in lines where the records start (which is designated by finding Date: at the beginning of a line):

starts <- which(grepl("^Date:", lines))

Now, we take those starts and look for the next occurrence of Date: (i.e. the next record). It's going to find them all, so we only care about the first one. To calculate that index, we add the start index and subtract 1. In theory there will only be one NA (i.e. the last record) but we lazily use ifelse vs just change he last one.

ends <- map_dbl(starts, function(i) {
  which(grepl("^Date:", lines[(i+1):length(lines)]))[1]+i-1
})
ends <- ifelse(is.na(ends), length(lines), ends)

So, now starts contains the indexes of the start of each record and ends contains the indexes of the ends of each record.

The map2_df() is super handy pseudo-wrapper for mapply() & do.call(rbind,…). We use the fact that these are in DCF format (key: value) and use read.dcf(). That makes a matrix and we then re-orient it and turn it into a data.frame.

We then separate the values, add the row names to make a time_interval column, add in the date, time and subject and make sure the columns are the right type.

We also use the fact that map2_df() will use the named list "keys" as a column if we tell it to.

Finally, we reorder the columns.

So, this will iterate over starts and ends and pass each iteration into start and end:

map2_df(starts, ends, function(start, end) {

  # now, we extract just the current record into `record` by pulling
  # out lines by the indexes.

  record <- lines[start:end]

  # we then use `read.dcf` to read in the date/subject/time values:

  header <- as.data.frame(read.dcf(textConnection(record[1:3])))

  # Since we do not have blank lines and you said the records were
  # uniform we can use the fact that they'll be at known index
  # positions in this `record`. So, we make a list of two vectors
  # which are the indexes. Each becomes `i` (two total iterations)
  # and we use the value in `i` to extract out the three lines from
  # `record` and read those via `read.dcf`.

  # But that reads things into a matrix and in an unhelpful order
  # so we transpose it into shape and make it a data frame since
  # we'll ultimately need that.

  # We use `separate` to take the single character space-separated
  # `V1` column and turn it into 4 columns. `read.dcf` gave us
  # named rows for each time interval so we promote that to a 
  # full-on column and then add in date/time/subject, ensuring
  # they are characters and not factors, then ensure that the 
  # values we split out from `V1` are numeric and not character or
  # factor.

  # `map_df` can add in the `A` and `B` from the named list we passed
  # in for us and we have it call that column `measure`.

  # finally, we put the columns in a better order.

  map_df(list(A=5:7, B=9:11), function(i) {
    read.dcf(textConnection(record[i])) %>%  
      t() %>% as_data_frame() %>%
      separate(V1, sprintf("V%d", 1:4)) %>%
      rownames_to_column("time_interval") %>%
      mutate(date=as.character(header$Date),
             time=as.character(header$Time),
             subject=header$Subject) %>%
      mutate_at(vars(starts_with("V")), as.numeric)

  }, .id="measure")

}) %>% 
  select(date, time, subject, measure, time_interval, V1, V2, V3, V4)

That produces the following output:

## # A tibble: 18 x 9
##          date     time  subject measure time_interval    V1    V2    V3    V4
##         <chr>    <chr>    <chr>   <chr>         <chr> <dbl> <dbl> <dbl> <dbl>
## 1  01/01/2016 12:00:00 Subject1       A             1     1     2     4     1
## 2  01/01/2016 12:00:00 Subject1       A             2     2     1     2     3
## 3  01/01/2016 12:00:00 Subject1       A             3     1     0     2     7
## 4  01/01/2016 12:00:00 Subject1       B             1     2     3     0     1
## 5  01/01/2016 12:00:00 Subject1       B             2     4     1     1     2
## 6  01/01/2016 12:00:00 Subject1       B             3     3     5     2     8
## 7  01/01/2016 12:00:00        2       A             1     8     2     0     1
## 8  01/01/2016 12:00:00        2       A             2     9     1     2     7
## 9  01/01/2016 12:00:00        2       A             3     1     6     2     7
## 10 01/01/2016 12:00:00        2       B             1     2     3     2     0
## 11 01/01/2016 12:00:00        2       B             2     6     7     1     2
## 12 01/01/2016 12:00:00        2       B             3     3     3     2     4
## 13 01/01/2016 12:00:00        2       A             1     8     2     0     1
## 14 01/01/2016 12:00:00        2       A             2     9     1     2     7
## 15 01/01/2016 12:00:00        2       A             3     1     6     2     7
## 16 01/01/2016 12:00:00        2       B             1     2     3     2     0
## 17 01/01/2016 12:00:00        2       B             2     6     7     1     2
## 18 01/01/2016 12:00:00        2       B             3     3     3     2     4

If you really need a base R solution then:

do.call(rbind, mapply(function(start, end) {

  record <- lines[start:end]
  header <- as.data.frame(read.dcf(textConnection(record[1:3])))

  do.call(rbind, lapply(list(A=5:7, B=9:11), function(i) {
    mat <- as.data.frame(t(read.dcf(textConnection(record[i]))))
    mat <- matrix(unlist(apply(mat, 1, strsplit, split=" "), use.names=FALSE), ncol=4, byrow=TRUE)
    mat <- as.data.frame(mat)
    mat$time_interval <- 1:3
    mat$date <- as.character(header$Date)
    mat$time <- as.character(header$Time)
    mat$subject <- as.character(header$Subject)
    mat
  })) -> df

  df$measure <- gsub("\\..*$", "", rownames(df))
  rownames(df) <- NULL
  df

}, starts, ends, SIMPLIFY=FALSE)) -> out_df
out_df[,c("date", "time", "subject", "measure", "time_interval", "V1", "V2", "V3", "V4")]

##          date     time  subject measure time_interval V1 V2 V3 V4
## 1  01/01/2016 12:00:00 Subject1       A             1  1  2  4  1
## 2  01/01/2016 12:00:00 Subject1       A             2  2  1  2  3
## 3  01/01/2016 12:00:00 Subject1       A             3  1  0  2  7
## 4  01/01/2016 12:00:00 Subject1       B             1  1  2  4  1
## 5  01/01/2016 12:00:00 Subject1       B             2  2  1  2  3
## 6  01/01/2016 12:00:00 Subject1       B             3  1  0  2  7
## 7  01/01/2016 12:00:00        2       A             1  8  2  0  1
## 8  01/01/2016 12:00:00        2       A             2  9  1  2  7
## 9  01/01/2016 12:00:00        2       A             3  1  6  2  7
## 10 01/01/2016 12:00:00        2       B             1  8  2  0  1
## 11 01/01/2016 12:00:00        2       B             2  9  1  2  7
## 12 01/01/2016 12:00:00        2       B             3  1  6  2  7
## 13 01/01/2016 12:00:00        2       A             1  8  2  0  1
## 14 01/01/2016 12:00:00        2       A             2  9  1  2  7
## 15 01/01/2016 12:00:00        2       A             3  1  6  2  7
## 16 01/01/2016 12:00:00        2       B             1  8  2  0  1
## 17 01/01/2016 12:00:00        2       B             2  9  1  2  7
## 18 01/01/2016 12:00:00        2       B             3  1  6  2  7

Thanks, this looks very promising; I will check it out further. However, ideally I would end up with a data frame reminiscent of what I achieved with my original code above. I will tinker around with your code to see if I can figure out how to reform it into a data frame. Would do.call() be appropriate? — GH28, Aug 22 '16 at 20:56
I completed the data frame creation. If you really need the rownames that shld be trivial as shld removing the extra columns if you really don't want them. — hrbrmstr, Aug 22 '16 at 21:40
Your code requires the `tibble` and `tidyr` packages, right? — GH28, Aug 23 '16 at 16:44
`library()` calls added to original solution. base R solution added. — hrbrmstr, Aug 23 '16 at 17:24
Thanks for the additional solution! I am still a little confused about a couple of things. — GH28, Aug 23 '16 at 19:31
(1) Could you describe what exactly is happening with the two assignments to `ends` (especially `function(i) { which(grepl("^Date:", lines[(i+1):length(lines)]))[1]+i-1 }`)? I understand the purpose but not the scripting. Where is `i` defined? (2) What is the difference between `starts` and `ends` vs `start` and `end` in the last part of your first solution? Thanks! — GH28, Aug 23 '16 at 19:49

G. Grothendieck · Answer 2 · 2016-08-23T23:21:53.123

It is not clear that a data frame is the most convenient way to represent this data. The following shows three alternate outputs:

three arrays -- a matrix with one row per subject having Date time and subject columns, an A array such that A[,,i] is the A matrix for the ith subject and a B array such that B[,,i] is the B matrix for the ith subject. No packages are used.
a data frame in wide form
a data frame in long form

No packages are used.

For all three, read the file into a character vector Lines. Then remove any blank lines using grep -- we could omit this step if we knew there were no blank lines. Then split Lines into subject groups s. Then lapply over the subject groups and within each grab the date, time and subject from the first three lines and the two matrices from lines 5:7 and 9:11 in a list with one component per subject. The key code to produce a list L easily reworked into different formats is just this:

Lines <- readLines("example_file.txt")
Lines <- grep("^\\s*$", Lines, value = TRUE, invert = TRUE)
s <- split(Lines, cumsum(grepl("^Date:", Lines)))
L <- lapply(s, function(x) list(read.dcf(textConnection(x[1:3])), 
           A = as.matrix(read.table(text = sub(":", "", x[5:7]), row.names = 1)),
           B = as.matrix(read.table(text = sub(":", "", x[9:11]), row.names = 1))))
names(L) <- sapply(L, function(x) x[[1]][, "Subject"])

Given L we can easily create various output formats by using lapply over it. Each of the three formats is shown in a separate section below. The output is shown at the end so as not to break up the code.

three arrays

We could use L as is but may be more convenient to to convert L to three arrays: (1) ident which is a 3 column matrix having as many rows as subjects with the Date, time and Subject for each, (2) A which is a 3d array such that A[,,i] is the A matrix for the ith subject, (3) B which is a 3d array such that B[,,i] is the B matrix for the ith subject.

ident <- do.call(rbind, lapply(L, "[[", 1))
A <- simplify2array(lapply(L, "[[", 2))
B <- simplify2array(lapply(L, "[[", 3))

data.frame - wide form

DF <- do.call(rbind, lapply(L, function(x) data.frame(x[[1]], x[[2]], x[[3]])))
names(DF)[4:7] <- "A"
names(DF)[8:11] <- "B"
rownames(DF) <- NULL

data.frame - long form

DF2 <- do.call(rbind, lapply(L, function(x)
          data.frame(x[[1]], rbind(cbind(AB = "A", x[[2]]), cbind(AB = "B", x[[3]])))))
rownames(DF2) <- NULL

output -- three arrays

> ident
     Date         Time       Subject   
[1,] "01/01/2016" "12:00:00" "Subject1"
[2,] "01/01/2016" "12:00:00" "2"       
> A
, , Subject1

  V2 V3 V4 V5
1  1  2  4  1
2  2  1  2  3
3  1  0  2  7

, , 2

  V2 V3 V4 V5
1  8  2  0  1
2  9  1  2  7
3  1  6  2  7

> B
, , Subject1

  V2 V3 V4 V5
1  2  3  0  1
2  4  1  1  2
3  3  5  2  8

, , 2

  V2 V3 V4 V5
1  2  3  2  0
2  6  7  1  2
3  3  3  2  4

output -- data frame wide form

> DF

        Date     Time  Subject A A A A B B B B
1 01/01/2016 12:00:00 Subject1 1 2 4 1 2 3 0 1
2 01/01/2016 12:00:00 Subject1 2 1 2 3 4 1 1 2
3 01/01/2016 12:00:00 Subject1 1 0 2 7 3 5 2 8
4 01/01/2016 12:00:00        2 8 2 0 1 2 3 2 0
5 01/01/2016 12:00:00        2 9 1 2 7 6 7 1 2
6 01/01/2016 12:00:00        2 1 6 2 7 3 3 2 4

output - data frame long form

> DF2
         Date     Time  Subject AB V2 V3 V4 V5
1  01/01/2016 12:00:00 Subject1  A  1  2  4  1
2  01/01/2016 12:00:00 Subject1  A  2  1  2  3
3  01/01/2016 12:00:00 Subject1  A  1  0  2  7
4  01/01/2016 12:00:00 Subject1  B  2  3  0  1
5  01/01/2016 12:00:00 Subject1  B  4  1  1  2
6  01/01/2016 12:00:00 Subject1  B  3  5  2  8
7  01/01/2016 12:00:00        2  A  8  2  0  1
8  01/01/2016 12:00:00        2  A  9  1  2  7
9  01/01/2016 12:00:00        2  A  1  6  2  7
10 01/01/2016 12:00:00        2  B  2  3  2  0
11 01/01/2016 12:00:00        2  B  6  7  1  2
12 01/01/2016 12:00:00        2  B  3  3  2  4

Out of curiosity, how would you recommend transforming this output into a consolidated data frame? Especially if you use only base R? — GH28, Aug 23 '16 at 16:48

Import unusually formatted text data using R

2 Answers2

Linked