R or python regex for data extraction from text file

Question

I have a text file of the following format: //

DATASET

..... unnecessary lines.....

TIMEUNITS SECONDS

    TS 0  1.98849600e+08
        3.30000000e-03    1.25400000e-02    5.88000000e-03    0.00000000e+00    0.00000000e+00
        5.88000000e-03    3.33000000e-03    2.16000000e-03    0.00000000e+00    0.00000000e+00
    TS 0  1.98853209e+08
        0.00000000e+00    0.00000000e+00    0.00000000e+00    0.00000000e+00    0.00000000e+00
        1.25400000e-02    5.88000000e-03    3.33000000e-03    0.00000000e+00    0.00000000e+00
    TS 0  1.98860419e+08
        3.33000000e-03    2.16000000e-03    1.08000000e-03    0.00000000e+00    0.00000000e+00
        0.00000000e+00    0.00000000e+00    0.00000000e+00    0.00000000e+00    0.00000000e+00
    TS 0  1.98864081e+08
        1.08000000e-03    8.70000000e-04    7.20000000e-04    0.00000000e+00    0.00000000e+00
        0.00000000e+00    0.00000000e+00    0.00000000e+00    0.00000000e+00    0.00000000e+00
    TS 0  1.98867619e+08
        0.00000000e+00    0.00000000e+00    0.00000000e+00    0.00000000e+00    0.00000000e+00
        3.33000000e-03    2.16000000e-03    1.08000000e-03    0.00000000e+00    0.00000000e+00

I am also attaching the sample text file named "D50.bc" in this link: https://drive.google.com/file/d/1P5aFC0JsRLhwuUo7JENLg03DbDJ696lk/view?usp=sharing.

There is no column names but it is fine to add column names i.e., V1, V2 etc. In the real text file there are 14 columns and 1000s of lines/rows after each TS. Each row corresponds to a node and the columns corresponds to certain values(i.e., velocity/shear stress etc.) at that node

I want to extract all the data/tables under the lines "TS 0 XXX" into individual dfs based on their timestamps(TS) so that I can do column operations for each TS. The TS value XXXX which is in second can be added as a separate column in the dfs. Below is the regular expression I have tried in R but it returns nothing. Any help is highly appreciated.

  library(dplyr)
    rm(list=ls(all=TRUE))
    a <- paste0(readLines("D:/D50 python/D50.bc"), collapse = "\n")
    b <- regmatches(a, gregexpr("(?s)^TS[^\n]+0.*?\n\\K.*?\n\b", a,, perl = T))[[1]]

I don't see hints of column names. Does this mean that with this sample data (in the code block), it would produce a list of five data.frames, each with two rows, and columns named `V1` through `V5`? What happens to the numbers on the same line as `TS 0`? (It would help to see a literal structure of what you are trying to accomplish.) — r2evans, Nov 20 '21 at 00:18
Yes, There is no column names but it will be fine to add column names i.e., V1, V2 etc. In the real text file there are 14 columns and 1000s of lines/rows after each TS. Each row corresponds to a node and the columns corresponds to certain values(i.e., velocity/shear stress etc.) at that node. Thanks much. — ZVY545, Nov 20 '21 at 00:22

r2evans · Accepted Answer · 2021-11-20T01:23:50.637

spltxt <- split(txt, cumsum(grepl("^\\s*TS 0 ", txt)))[-1]
alldat <- Map(function(S, grp) {
  out <- read.table(text = S[-1], header = FALSE)
  out$grp <- grp
  out$node <- seq_len(nrow(out))
  TS <- trimws(strsplit(S[1], "\\s+")[[1]])
  out$TS <- as.numeric(TS[length(TS)])
  out$TS0 <- S[1]
  out
}, spltxt, seq_along(spltxt))
out <- do.call(rbind, alldat)
out
#          V1      V2      V3 V4 V5 grp node        TS                      TS0
# 1.1 0.00330 0.01254 0.00588  0  0   1    1 198849600     TS 0  1.98849600e+08
# 1.2 0.00588 0.00333 0.00216  0  0   1    2 198849600     TS 0  1.98849600e+08
# 2.1 0.00000 0.00000 0.00000  0  0   2    1 198853209     TS 0  1.98853209e+08
# 2.2 0.01254 0.00588 0.00333  0  0   2    2 198853209     TS 0  1.98853209e+08
# 3.1 0.00333 0.00216 0.00108  0  0   3    1 198860419     TS 0  1.98860419e+08
# 3.2 0.00000 0.00000 0.00000  0  0   3    2 198860419     TS 0  1.98860419e+08
# 4.1 0.00108 0.00087 0.00072  0  0   4    1 198864081     TS 0  1.98864081e+08
# 4.2 0.00000 0.00000 0.00000  0  0   4    2 198864081     TS 0  1.98864081e+08
# 5.1 0.00000 0.00000 0.00000  0  0   5    1 198867619     TS 0  1.98867619e+08
# 5.2 0.00333 0.00216 0.00108  0  0   5    2 198867619     TS 0  1.98867619e+08

Walk-through:

cumsum(grepl(...)) groups the data by leading TS 0, so that we can split them up into individual matrices:
```
cumsum(grepl("^\\s*TS 0 ", txt))
#  [1] 0 0 0 0 0 0 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5
```
We split the text up by these numbers, so 0 contains all of the preamble; 1 includes the first TS 0 group, etc. In this case, the 0s correspond to all text before the first TS 0, so we can discard the first segment with [-1].
Map(... iterates our function(S, grp) over each of the numbered groups as well as counting along the groups, called five times in this case.
read.table(...) makes short work of converting blank-space delimited numbers into a data.frame without pre-defined column names. The normal frame $-assignment takes care of the group and node values, as well as parsing out the timestamp numeric in the TS 0 line.
do.call(rbind, alldat) is one way to combine all frames in a list into a single frame; alternatives include dplyr::bind_rows(alldat) and data.table::rbindlist(alldat). (All of these are significantly better/faster than iteratively building a frame in a for loop.)

Assumption:

all matrix groups have the same number of columns

r2evans, awesome! Would you please also tell me if I want to separate a certain node i.e., node 1 which is 1.1,2.1, 3.1, 4.1 in a df or a certain timestamp/TSO i.e., 1.98867619e+08 values then how to do that? Thanks much! — ZVY545, Nov 20 '21 at 01:01
I think you're asking how to subset the resulting frame based on the "node" (which are near-arbitrary row names assigned by `rbind`) and/or some value. If so, then that's simple R work in `data.frame`, perhaps using `base::subset` or `dplyr::filter` or similar. If the "node" (first group, second row within the group) is important, then make sure you collect the row names into a column variable since, in my opinion, row names are fragile in many R workflows (and, as already requested once, it would have helped to know your expected output structure). — r2evans, Nov 20 '21 at 01:04
r2evans, the output structure that you have given will serve my purpose, It will be good only to add another column for the node numbers, first row is node 1, second row is node 2. So, basically in my codeblock there are only two nodes for each of the TS. I would like to perform some column operations for certain nodes, for my desired/all-time steps. I am going to accept your answers. Brilliant! — ZVY545, Nov 20 '21 at 01:14
Thanks, much for your awesomeness, and for the explanation of the code will help me a lot to learn!!! — ZVY545, Nov 20 '21 at 01:43
r2evans, how do you read the .bc file into txt? Using readtext or readLines reads the file but spltxt is not returning anything? Thanks. — ZVY545, Nov 20 '21 at 02:24
readLines just worked without using collapse argumen. Thanks. — ZVY545, Nov 20 '21 at 02:28

R or python regex for data extraction from text file

1 Answers1