How to split a file in several files by a condition in R

Question

I have an excel file (my.file) with a situation similar to this:

TABLE   1           
DATE    2/15/18         
Amount  36          
Nº  Tag Sex IN WEIGHT   OUT WEIGHT
1   178 F     28.2        399.0
2   315 F     30.1        399.0
3   346 F     30.6        399.0
4   379 F     37.0        399.0
              31.5        399.0

TABLE   2           
DATE    5/20/18         
Amount  25          
Nº  Tag Sex IN WEIGHT   OUT WEIGHT
1   40  F      35.0        0.0
2   231 F      50.3        0.0
3   243 F      38.0        0.0
4   306 F      35.8        0.0
5   424 F      48.3        0.0
               41.5        0.0

An image of my file:

So, the last row for each table is the Mean, and I would like to split this file in two, like this:

file 1

Nº  Tag Sex IN WEIGHT   OUT WEIGHT
1   178 F     28.2        399.0
2   315 F     30.1        399.0
3   346 F     30.6        399.0
4   379 F     37.0        399.0

file 2

Nº  Tag Sex IN WEIGHT   OUT WEIGHT
1   40  F      35.0        0.0
2   231 F      50.3        0.0
3   243 F      38.0        0.0
4   306 F      35.8        0.0
5   424 F      48.3        0.0

I tried this:

split(my.file, with(my.file, interaction("Nº","Tag","Sex","IN WEIGHT","OUT WEIGHT")), drop = TRUE)

But didn't work.

OBS: my real file has much more tables, columns and rows. Thanks.

*"But didn't work"* is vague, can you be more clear? If there are errors or warnings, please include the exact text. If it just doesn't look right, well, we still need to know what doesn't work about it. Since we don't have `my.file`, it's difficult to speculate on this. — r2evans, Oct 08 '19 at 00:10
Sorry, I used the example like `my.file`, I tried that code, but I didn't get what I wanted. — Curious G., Oct 08 '19 at 00:21
(1) is `my.file` a string indicating the file name? or is it the result of some command that reads it in as a too-big `data.frame`? (2) *"didn't get what I wanted"* is **still too vague**. Perhaps #2 will become moot if we find out how you read it in and can find a way to replicate what you are doing. Perhaps instead (since I suspect the "excel" part of this is not important): can you paste the output of `dput(head(my.file))` into your question? We don't need everything, just as much as you've provided above ... but it is an unambiguous way for us to play with your data. — r2evans, Oct 08 '19 at 00:25
Sorry again, but I thought a small example could be used in a larger one, because the logic is the same. — Curious G., Oct 08 '19 at 00:31
The issue is this: you start by saying there is a single Excel files with two tables. Then you use `split` (typically on lists) and `interaction` (no idea how that is coming into play), I still don't know what `my.file` is. Nor how you read in from the Excel file. So I don't know how to help. — r2evans, Oct 08 '19 at 05:05
@r2evans, I used this because I'm not an expert in R, so I made a research and I used `split`, because I imagined that I could to use to solve my problem. But I couldn't. And to show that I tried something before to post my question, and was a form to know if I was in the correct way. But I don't, and it's not necessary to use this approach, was just an attempt. Basically, I just want to solve the simple example, `my.file`, a file with only two tables. I think if I can understand this example, I will be able to solve my real problem. — Curious G., Oct 08 '19 at 12:01
Or I'm not a native english speaker, maybe the idea of split be different, I would like to separate this file in another two, one for each table. — Curious G., Oct 08 '19 at 12:10
`split` is typically used on lists or vectors, for example `x <- 1:10; split(x, x%%2==0)` (split on whether a number is odd or even) will return a list of length two. `interaction` is used for representing interaction of the given factors, I don't know how it could be applied here. (I'm not knocking that you don't understand those functions -- you have to start somewhere -- and your english is good! I would not have guessed. My comments are based on vague points of your question.) — r2evans, Oct 08 '19 at 14:48
I see, and I appreciate your time and support, =D, I will try to improve my questions next time. But I thinks is that, I would like to separate the tables and after save in different files. And again thanks for your considerations. =) — Curious G., Oct 08 '19 at 15:12

score 1 · Accepted Answer · answered Oct 08 '19 at 15:52

We can use a relative of split (the by function), but not in the way you think.

First, for a reproducible answer, I'll generate an XLSX:

wb <- openxlsx::createWorkbook()
openxlsx::addWorksheet(wb, "MySheet")
openxlsx::writeData(wb, 1, data.frame(x=c("TABLE", "DATE", "Amount"), y=c("1", "2/15/18", "36")),
                    startCol=1, startRow=1, colNames=FALSE, rowNames=FALSE)
openxlsx::writeData(wb, 1, mtcars[1:3,1:5],
                    startCol=1, startRow=4, colNames=TRUE, rowNames=FALSE)
openxlsx::writeData(wb, 1, data.frame(x=c("TABLE", "DATE", "Amount"), y=c("2", "5/20/18", "25")),
                    startCol=1, startRow=9, colNames=FALSE, rowNames=FALSE)
openxlsx::writeData(wb, 1, mtcars[11:13,1:5],
                    startCol=1, startRow=12, colNames=TRUE, rowNames=FALSE)
openxlsx::saveWorkbook(wb, "quux.xlsx")

Which generates a single worksheet with the following:

From here, I'll first read in the whole page, knowing that this will be "not usable the way we want":

my.file <- openxlsx::read.xlsx("quux.xlsx", skipEmptyRows = FALSE, colNames = FALSE)
my.file
#        X1      X2    X3   X4   X5
# 1   TABLE       1  <NA> <NA> <NA>
# 2    DATE 2/15/18  <NA> <NA> <NA>
# 3  Amount      36  <NA> <NA> <NA>
# 4     mpg     cyl  disp   hp drat
# 5      21       6   160  110  3.9
# 6      21       6   160  110  3.9
# 7    22.8       4   108   93 3.85
# 8    <NA>    <NA>  <NA> <NA> <NA>
# 9   TABLE       2  <NA> <NA> <NA>
# 10   DATE 5/20/18  <NA> <NA> <NA>
# 11 Amount      25  <NA> <NA> <NA>
# 12    mpg     cyl  disp   hp drat
# 13   17.8       6 167.6  123 3.92
# 14   16.4       8 275.8  180 3.07
# 15   17.3       8 275.8  180 3.07

(I printed out the whole table, because (1) I wanted to demonstrate the table groups, and (2) because I know it's not huge. In your case, it will likely be rather large, so feel free to practice judicious use of head or related views on your larger frame.)

What we're going to do is use the first column (X1) of this wrong-full-frame and find which blocks of rows start with a known header. I'm using the first one, but really any known column can be used to determine when a "real data table" starts post-header.

cumsum(is.na(my.file[[1]]) | my.file[[1]] == "mpg")
#  [1] 0 0 0 1 1 1 1 2 2 2 2 3 3 3 3

This tells us that rows 4-7 of this wrong-full-frame are likely grouped together. It also thinks that rows 1-3 are grouped, but that's because every row is in a group, whether a usable-group or not-usable-group. Because of this, when we grab the data for real, we'll need to check if the group we're working on is "usable" or "not-usable" based on whether the known column name is present.

From here, we're going to repeatedly read from the file based on the group type and location.

alltables <- by(
  seq_len(nrow(my.file)),
  cumsum(is.na(my.file[[1]]) | my.file[[1]] == "mpg"),
  function(ind) {
    v <- my.file[[1]][ind[1]]
    if (!is.na(v) && v == "mpg")
      openxlsx::read.xlsx("quux.xlsx", rows=ind)
  })

alltables <- Filter(Negate(is.null), alltables)
alltables
# $`1`
#    mpg cyl disp  hp drat
# 1 21.0   6  160 110 3.90
# 2 21.0   6  160 110 3.90
# 3 22.8   4  108  93 3.85
# $`3`
#    mpg cyl  disp  hp drat
# 1 17.8   6 167.6 123 3.92
# 2 16.4   8 275.8 180 3.07
# 3 17.3   8 275.8 180 3.07

While we take a performance hit for repeatedly reading the xlsx file, this ensures that we get the full gamut of formatting and classification of columns; had we extracted it directly from my.file, everything would likely have been character, and we would need to do all class conversions manually. (I wish there were an easy way to openxlsx::loadWorkbook and repeatedly read from the same handle.)

In your example, you have a summary row. Since it does not have anything in the first column, it is interpreted in my above code as "an empty row" and will not be imported into the frames.

If you really want this row, then you might need to include more logic into the cumsum to locate the additional row; or you can include the first row after each group, assuming that there will be something usable. Note that in the latter case here, you'll be responsible for fixing NA values ... and frankly, storing summary statistics in the same frame as the data is not really a "best practice".
If you get these extra rows and don't want them, you can choose to filter them post-read, or you can selectively remove the last row from ind based on criteria specific to your data.

Follow-up: you didn't ask, but there are three things you can now do with allframes:

If they are all the same format (column names), and ultimately they need to be grouped together, then you can combine them into one frame with one of the following, depending on packages you may be using:
- (base R) do.call(rbind, alltables)
- dplyr::bind_rows(alltables)
- data.table::rbindlist(alltables)
Note that the latter two will take more effort to align columns if some tables have extra/missing columns and/or columns in a different order; the base-R version is not tolerant of any column-differences.
If they are all the same format but you need to keep them independent, then I recommend you keep them in this list: it is highly likely that when you do something to one frame you will be doing the identical task to all other frames in the list, so it is recommended that you do this "thing" within lapply. (See How do I make a list of data frames?).
You can assign them to individual frames within the environment. (If they are the same format, then I strongly urge #1 or #2 above. If they are different formats, then this one might really make sense.)

Perfect, that is it. And I really appreciate your patience and help. You are the best. — Curious G., Oct 08 '19 at 17:01

How to split a file in several files by a condition in R

1 Answers1

Linked