Split dynamic CSV file into 3 separate files using Bash script with either Awk, Sed, Grep, etc

Question

i have seen similar questions all over but none seems close to what i'm trying to achieve.

I have a dynamic csv file (tab delimiter) that updates/gets appended each hour BUT NOTE: Only the number of rows underneath HEADER 1 and HEADER 2 increases every hour. Pls see two examples below as reference

Example of FileA.csv at the 3rd hour

HEADER 1 NUM
hour 1   5
hour 2   10
hour 3   15
HEADER 2 NUM
hour 1   3
hour 2   6
hour 3   9
HEADER 3 NUM
age      23
bus      21
pig      07
dog      40

Example of FileA.csv at the 7th hour

HEADER 1 NUM
hour 1   5
hour 2   10
hour 3   15
hour 4   20
hour 5   25
hour 6   30
hour 7   35
HEADER 2 NUM
hour 1   3
hour 2   6
hour 3   9
hour 4   12
hour 5   15
hour 6   18
hour 7   21
HEADER 3 NUM
age      13
bus      28
pig      85
dog      55

The rows underneath Header 1 and Header 2 increases each hour. Header 3 and below is the only thing that remains constant

So what i'm trying to achieve is simply separate FileA.csv into ABC.csv , DEF.csv , GHI.csv

using the 3rd hour example for reference to what i'm trying to achieve

ABC.csv

HEADER 1 NUM
hour 1   5
hour 2   10
hour 3   15

DEF.csv

HEADER 2 NUM
hour 1   3
hour 2   6
hour 3   9

GHI.csv

HEADER 3 NUM
age      23
bus      21
pig      07
dog      40

Below is what i tried to do using grep but i can't combine grep and cut to achieve this. I've tried using Sed but not sure how to cut and move it after searching. i know this can be achieved with awk but not strong in awk

First cut out HEADER 3 and subsequent rows below and putting that into GHI.csv since that will always be constant, that way we are left with HEADER 1 and HEADER 2.
Then cut out HEADER 2 and below by searching for the Header name and cutting it out with all subsequent rows beneath it
Finally we are left with HEADER 1 which we either leave in FileA.csv or move it to ABC.csv

Pls Help. Thanks

_Header 3 and below is the only thing that remains constant_: The examples you show contradict this statement. After 7 hours the `HEADER 3 NUM` block is not the same as after 3 hours. — Renaud Pacalet, Aug 02 '23 at 05:42
From the answers you received, were you able to do what you wanted to do? I am surprised that this question was closed. It seemed like a straightforward and clear question to me. — zedfoxus, Aug 02 '23 at 16:31
@zedfoxus not yet. i'm currently editing the question and i will re-post it. Hopefully it's more straightforward and focused — igbins09, Aug 02 '23 at 18:13

Renaud Pacalet · Answer 1 · 2023-08-02T08:30:49.650

With any awk and any number of text blocks:

awk '/^HEADER/ {n++} {print>("File" n ".csv")}' FileA.csv

We simply increment variable n (initialized to 0, by default) each time a line starts with HEADER, we print all lines with redirection to a file named Filen.csv.

Note: if other lines can also start with HEADER you can be more specific about the header regex (e.g., /^HEADER [[:digit:]]+ NUM$/).

The output file names are File1.csv, File2.csv, ... If you absolutely want ABC.csv, DEF.csv, GHI.csv you can use:

awk -v f="ABC.csv DEF.csv GHI.csv" '
  BEGIN {split(f,files)} /^HEADER/ {n++} {print>files[n]}' FileA.csv

Explanations:

We pass the space-separated list of file names as variable f.
We split it on spaces and store it in array files.
When printing, instead of redirecting to file Filen.csv we redirect to entry number n of files array.

Note that if you have more text blocks than listed files you will get an error when the array index overflows.

score 1 · Answer 2 · answered Aug 02 '23 at 04:23

Assuming the headers literally have "HEADER ..." lines as described, would you please try:

awk '
    BEGIN {                     # define filenames to write
        fname[1] = "ABC.csv"; fname[2] = "DEF.csv"; fname[3] = "GHI.csv"
    }
    /^HEADER/ {                 # reached the header line
        if (c >= 1) close(file) # close the previous file, if opened
        file = fname[++c]       # update the filename to write
    }
    {
        print > file            # append to the file
    }
' FileA.csv

Btw the fact the file is growing seems to be unrelated with file splitting.

zedfoxus · Answer 3 · 2023-08-02T04:40:28.610

You can write something like this. You won't need awk, sed, or grep. Bash itself can do this for you.

test.sh

#!/bin/bash

FILE=FileA.csv

OUTPUT=ABC.csv
while read CMD; do
    
    if [[ "$CMD" == HEADER*1*NUM ]]; then
    OUTPUT=ABC.csv
    elif [[ "$CMD" == HEADER*2*NUM ]]; then
    OUTPUT=DEF.csv
    elif [[ "$CMD" == HEADER*3*NUM ]]; then
    OUTPUT=GHI.csv
    fi

    echo "$CMD" >> $OUTPUT

done < "$FILE"

echo "Done"

Let's run it

chmod 755 test.sh
./test.sh

Resulting files

ABC.csv

HEADER 1 NUM
hour 1   5
hour 2   10
hour 3   15

DEF.csv

HEADER 2 NUM
hour 1   3
hour 2   6
hour 3   9

GHI.csv

HEADER 3 NUM
age      23
bus      21
pig      07
dog      40

Explanation

We loop through each line of the file. If we see HEADER 1 NUM, we say that the lines should be written to ABC.csv. If the line has HEADER 2 NUM, we say that the lines should be written to DEF and so on.

Then we write the lines to the respective file.

For example

We read the first line. It has HEADER 1 NUM, which matches the regular expression HEADER1NUM. So, we say that the output file should be ABC.csv
Then, we echo the line (stored in CMD variable) and send it to the output file, which we said was ABC.csv. The >> means append to ABC.csv file. So, HEADER 1 NUM gets written to that file
Then, we read the 2nd line. None of the if..elif..elif..fi statements match the next line. So, the next line is echo'ed and appended to ABC.csv
3rd line - same thing
When the line with HEADER 2 NUM appears, the 1st elif meets that criteria and the output file is changed to DEF.csv
The HEADER 2 NUM line is written to DEF.csv
The line following that is written to DEF.csv
That keeps on going until the HEADER 3 NUM line matches the 2nd elif. That's when the output file changes to GHI.csv
HEADER 3 NUM is written to GHI.csv
Subsequent lines are also written to GHI.csv

If you want ABC, DEF, and GHI files to be removed, you can write rm ABC.csv DEF.csv, GHI.csv right before or after FILE=FileA.csv line in the script. That way, you are always getting brand new files.

Please copy/paste that script into http://shellcheck.net and read [why-is-using-a-shell-loop-to-process-text-considered-bad-practice](https://unix.stackexchange.com/questions/169716/why-is-using-a-shell-loop-to-process-text-considered-bad-practice) and [correct-bash-and-shell-script-variable-capitalization](https://stackoverflow.com/questions/673055/correct-bash-and-shell-script-variable-capitalization) — Ed Morton, Aug 02 '23 at 10:12

Split dynamic CSV file into 3 separate files using Bash script with either Awk, Sed, Grep, etc

3 Answers3

test.sh

Let's run it

Resulting files

Explanation

Linked