Is there a better way in R to split a file with multiple sections

Question

I am reading in a CSV file into R that contains multiple sections, the sections contain different headers, number of rows and columns. Example table below, I need to separate them into different data frames so I can process them based on the section type.

The number of sections can change and so far I have only figured out how to hard code them and use grep to split the different sections into different data frames.

The sections are all in the same format ==XY== where X is a letter and Y is a number

Is there a better way to split the data frame into a different data frame for each section no matter how many sections there are?

z1 <- structure(list(V1 = c("==C5===", "H1", "1", "3", "8", "==E5===", 
                            "H1", "10", "2", "==G6===", "H1", "5", "==H4===", "H1", "1", 
                            "==H6===", "H1", "10"), V2 = c("", "H2", "9", "8", "1", "", "H2", 
                                                           "4", "2", "", "", "", "", "H2", "8", "", "", ""), V3 = c("", 
                                                                                                                    "H3", "2", "5", "6", "", "", "", "", "", "", "", "", "", "", 
                                                                                                                    "", "", "")), class = "data.frame", row.names = c(NA, -18L))
DF1 <- z1[grep("==C5", z1$V1):grep("==E5", z1$V1),]
DF2 <- z1[grep("==E5", z1$V1):grep("==G6", z1$V1),]
DF3 <- z1[grep("==G6", z1$V1):grep("==H4", z1$V1),]
DF4 <- z1[grep("==H4", z1$V1):grep("==H6", z1$V1),]
DF5 <- z1[grep("==H6", z1$V1):nrow(z1),]

score 2 · Accepted Answer · answered Feb 21 '20 at 14:26

Assuming you are starting out with a csv, here is a way to read in each specific set of rows.

write.table(z1, "tmp.csv", na ="", row.names = FALSE, 
            col.names = FALSE, sep = ",")

tmp <- readLines("tmp.csv")

# start of each 'file'
sof <- grep("==", tmp)

# the actual start is one past that
real_start <- sof + 1

# figure out the end of each unique df
real_end <- c(sof[-1] - 1, length(tmp))

# the number of rows to read in
to_read <- real_end - real_start

# a list to store your data.frames
my_dfs <- vector("list", length = length(real_start))
for(i in 1:length(my_dfs)){
  # suppressing warnings, as there are a lot 
  #  that come up when a column header is not
  #  in a specific data.frame
  my_dfs[[i]] <- suppressWarnings(
    data.table::fread("tmp.csv",
                      skip = sof[i],
                      nrows = to_read[i],
                      fill = FALSE,
                      check.names = FALSE,
                      data.table = FALSE,
                      select = c("H1", "H2", "H3")
    )
  )
}

The output of this is:

[[1]]
  H1 H2 H3
1  1  9  2
2  3  8  5
3  8  1  6

[[2]]
  H1 H2
1 10  4
2  2  2

[[3]]
  H1
1  5

[[4]]
  H1 H2
1  1  8

[[5]]
  H1
1 10

If you do not know the column headers you could use read.csv instead, but you end up with some NA columns.

my_dfs2 <- vector("list", length = length(real_start))
for(i in 1:length(my_dfs2)){
  # suppressing warnings, as there are a lot 
  #  that come up when a column header is not
  #  in a specific data.frame
  my_dfs2[[i]] <- 
             read.csv("tmp.csv",
                      skip = sof[i],
                      nrows = to_read[i],
                      fill = FALSE
             )
}

The output from this is:

[[1]]
  H1 H2 H3
1  1  9  2
2  3  8  5
3  8  1  6

[[2]]
  H1 H2  X
1 10  4 NA
2  2  2 NA

[[3]]
  H1  X X.1
1  5 NA  NA

[[4]]
  H1 H2  X
1  1  8 NA

[[5]]
  H1  X X.1
1 10 NA  NA

You could then remove the NA columns as the next processing step.

score 1 · Answer 2 · answered Feb 21 '20 at 13:59

You can try:

lapply(split(z1, cumsum(grepl("=", z1$V1))), tail, -1)

$`1`
  V1 V2 V3
2 H1 H2 H3
3  1  9  2
4  3  8  5
5  8  1  6

$`2`
  V1 V2 V3
7 H1 H2   
8 10  4   
9  2  2   

$`3`
   V1 V2 V3
11 H1      
12  5      

$`4`
   V1 V2 V3
14 H1 H2   
15  1  8   

$`5`
   V1 V2 V3
17 H1      
18 10

G. Grothendieck · Answer 3 · 2020-02-21T16:17:49.253

Below we assume that what you really want is to have the H1, H2, ... as the column names of each data.frame created.

Paste the columns together giving p, remove the == elements giving p2, form a grouping variable g and split p2 on g using read.table to read in each component. No packages are used.

p <- do.call(paste, z1)
p2 <- p[!grepl("^==", p)]
g <- cumsum(grepl("^\\D", p2))
L <- lapply(split(p2, g), function(x) read.table(text = x, header = TRUE))
L

giving this list of data frames:

$`1`
  H1 H2 H3
1  1  9  2
2  3  8  5
3  8  1  6

$`2`
  H1 H2
1 10  4
2  2  2

...etc...

Below we discuss some additional things we can optionally do with L.

Loose data frames

Although not really recommended if you really want to create loose data frames hanging out in your global environment you could do this:

names(L) <- paste0("DF", names(L))
list2env(L, .GlobalEnv)

Using title

If the title on the == lines has meaning we could use them instead of 1, 2, 3, ... like this:

names(L) <- trimws(gsub("=", "", grep("^==", p, value = TRUE)))
L

giving:

$C5
  H1 H2 H3
1  1  9  2
2  3  8  5
3  8  1  6

$E5
  H1 H2
1 10  4
2  2  2

...etc...

Creating a data frame from the list

We could use rbindlist in data.table to create a data frame/data.table from L like this:

library(data.table)
rbindlist(L, fill = TRUE, id = "id")

giving the following where id identifies the component number.

   id H1 H2 H3
1:  1  1  9  2
2:  1  3  8  5
3:  1  8  1  6
4:  2 10  4 NA
5:  2  2  2 NA
6:  3  5 NA NA
7:  4  1  8 NA
8:  5 10 NA NA

or this if you had set the titles on the == lines as the names:

     id H1 H2 H3
1: C5    1  9  2
2: C5    3  8  5
3: C5    8  1  6
4: E5   10  4 NA
5: E5    2  2 NA
6: G6    5 NA NA
7: H4    1  8 NA
8: H6   10 NA NA

Is there a better way in R to split a file with multiple sections

3 Answers3

Loose data frames

Using title

Creating a data frame from the list

Linked