-2

i have seen similar questions all over but none seems close to what i'm trying to achieve.

I have a dynamic csv file (tab delimiter) that updates/gets appended each hour BUT NOTE: Only the number of rows underneath HEADER 1 and HEADER 2 increases every hour. Pls see two examples below as reference

Example of FileA.csv at the 3rd hour

HEADER 1 NUM
hour 1   5
hour 2   10
hour 3   15
HEADER 2 NUM
hour 1   3
hour 2   6
hour 3   9
HEADER 3 NUM
age      23
bus      21
pig      07
dog      40

Example of FileA.csv at the 7th hour

HEADER 1 NUM
hour 1   5
hour 2   10
hour 3   15
hour 4   20
hour 5   25
hour 6   30
hour 7   35
HEADER 2 NUM
hour 1   3
hour 2   6
hour 3   9
hour 4   12
hour 5   15
hour 6   18
hour 7   21
HEADER 3 NUM
age      13
bus      28
pig      85
dog      55

The rows underneath Header 1 and Header 2 increases each hour. Header 3 and below is the only thing that remains constant

So what i'm trying to achieve is simply separate FileA.csv into ABC.csv , DEF.csv , GHI.csv

using the 3rd hour example for reference to what i'm trying to achieve

ABC.csv

HEADER 1 NUM
hour 1   5
hour 2   10
hour 3   15

DEF.csv

HEADER 2 NUM
hour 1   3
hour 2   6
hour 3   9

GHI.csv

HEADER 3 NUM
age      23
bus      21
pig      07
dog      40

Below is what i tried to do using grep but i can't combine grep and cut to achieve this. I've tried using Sed but not sure how to cut and move it after searching. i know this can be achieved with awk but not strong in awk

  1. First cut out HEADER 3 and subsequent rows below and putting that into GHI.csv since that will always be constant, that way we are left with HEADER 1 and HEADER 2.
  2. Then cut out HEADER 2 and below by searching for the Header name and cutting it out with all subsequent rows beneath it
  3. Finally we are left with HEADER 1 which we either leave in FileA.csv or move it to ABC.csv

Pls Help. Thanks

igbins09
  • 167
  • 8
  • _Header 3 and below is the only thing that remains constant_: The examples you show contradict this statement. After 7 hours the `HEADER 3 NUM` block is not the same as after 3 hours. – Renaud Pacalet Aug 02 '23 at 05:42
  • 1
    From the answers you received, were you able to do what you wanted to do? I am surprised that this question was closed. It seemed like a straightforward and clear question to me. – zedfoxus Aug 02 '23 at 16:31
  • @zedfoxus not yet. i'm currently editing the question and i will re-post it. Hopefully it's more straightforward and focused – igbins09 Aug 02 '23 at 18:13

3 Answers3

2

With any awk and any number of text blocks:

awk '/^HEADER/ {n++} {print>("File" n ".csv")}' FileA.csv

We simply increment variable n (initialized to 0, by default) each time a line starts with HEADER, we print all lines with redirection to a file named Filen.csv.

Note: if other lines can also start with HEADER you can be more specific about the header regex (e.g., /^HEADER [[:digit:]]+ NUM$/).

The output file names are File1.csv, File2.csv, ... If you absolutely want ABC.csv, DEF.csv, GHI.csv you can use:

awk -v f="ABC.csv DEF.csv GHI.csv" '
  BEGIN {split(f,files)} /^HEADER/ {n++} {print>files[n]}' FileA.csv

Explanations:

  • We pass the space-separated list of file names as variable f.
  • We split it on spaces and store it in array files.
  • When printing, instead of redirecting to file Filen.csv we redirect to entry number n of files array.

Note that if you have more text blocks than listed files you will get an error when the array index overflows.

Renaud Pacalet
  • 25,260
  • 3
  • 34
  • 51
1

Assuming the headers literally have "HEADER ..." lines as described, would you please try:

awk '
    BEGIN {                     # define filenames to write
        fname[1] = "ABC.csv"; fname[2] = "DEF.csv"; fname[3] = "GHI.csv"
    }
    /^HEADER/ {                 # reached the header line
        if (c >= 1) close(file) # close the previous file, if opened
        file = fname[++c]       # update the filename to write
    }
    {
        print > file            # append to the file
    }
' FileA.csv

Btw the fact the file is growing seems to be unrelated with file splitting.

tshiono
  • 21,248
  • 2
  • 14
  • 22
0

You can write something like this. You won't need awk, sed, or grep. Bash itself can do this for you.

test.sh

#!/bin/bash

FILE=FileA.csv

OUTPUT=ABC.csv
while read CMD; do
    
    if [[ "$CMD" == HEADER*1*NUM ]]; then
    OUTPUT=ABC.csv
    elif [[ "$CMD" == HEADER*2*NUM ]]; then
    OUTPUT=DEF.csv
    elif [[ "$CMD" == HEADER*3*NUM ]]; then
    OUTPUT=GHI.csv
    fi

    echo "$CMD" >> $OUTPUT

done < "$FILE"

echo "Done"

Let's run it

chmod 755 test.sh
./test.sh

Resulting files

ABC.csv

HEADER 1 NUM
hour 1   5
hour 2   10
hour 3   15

DEF.csv

HEADER 2 NUM
hour 1   3
hour 2   6
hour 3   9

GHI.csv

HEADER 3 NUM
age      23
bus      21
pig      07
dog      40

Explanation

We loop through each line of the file. If we see HEADER 1 NUM, we say that the lines should be written to ABC.csv. If the line has HEADER 2 NUM, we say that the lines should be written to DEF and so on.

Then we write the lines to the respective file.

For example

  • We read the first line. It has HEADER 1 NUM, which matches the regular expression HEADER1NUM. So, we say that the output file should be ABC.csv
  • Then, we echo the line (stored in CMD variable) and send it to the output file, which we said was ABC.csv. The >> means append to ABC.csv file. So, HEADER 1 NUM gets written to that file
  • Then, we read the 2nd line. None of the if..elif..elif..fi statements match the next line. So, the next line is echo'ed and appended to ABC.csv
  • 3rd line - same thing
  • When the line with HEADER 2 NUM appears, the 1st elif meets that criteria and the output file is changed to DEF.csv
  • The HEADER 2 NUM line is written to DEF.csv
  • The line following that is written to DEF.csv
  • That keeps on going until the HEADER 3 NUM line matches the 2nd elif. That's when the output file changes to GHI.csv
  • HEADER 3 NUM is written to GHI.csv
  • Subsequent lines are also written to GHI.csv

If you want ABC, DEF, and GHI files to be removed, you can write rm ABC.csv DEF.csv, GHI.csv right before or after FILE=FileA.csv line in the script. That way, you are always getting brand new files.

zedfoxus
  • 35,121
  • 5
  • 64
  • 63
  • Please copy/paste that script into http://shellcheck.net and read [why-is-using-a-shell-loop-to-process-text-considered-bad-practice](https://unix.stackexchange.com/questions/169716/why-is-using-a-shell-loop-to-process-text-considered-bad-practice) and [correct-bash-and-shell-script-variable-capitalization](https://stackoverflow.com/questions/673055/correct-bash-and-shell-script-variable-capitalization) – Ed Morton Aug 02 '23 at 10:12