Split dynamic TSV/CSV file into 3 separate files using Bash script with Awk

Question

I have a dynamic csv/tsv file (tab delimiter) where a new row is added underneath Debit Count and Score Count every hour. No new rows will be added under Receipt Count, only the value changes each hour. Please see two examples below for reference.

Example of FileA.csv at the 3rd hour

Debit Count     VALUE
hour 1          5
hour 2          81
hour 3          15
Score Count
hour 1          31
hour 2          66
hour 3          9
Receipt Count
age logs        23
bus logs        21
pig logs        7
dog logs        40

Example of FileA.csv at the 7th hour

Debit Count     VALUE
hour 1          5
hour 2          81
hour 3          15
hour 4          20
hour 5          52
hour 6          33
hour 7          35
Score Count    
hour 1          31
hour 2          66
hour 3          9
hour 4          112
hour 5          15
hour 6          38
hour 7          21
Receipt Count  
age logs        13
bus logs        28
pig logs        85
dog logs        55

So what i'm trying to achieve is separate FileA.csv into ABC.csv , DEF.csv and GHI.csv keeping in mind that the rows underneath Debit Count and Score Count increases every hour. The new files ABC.csv,DEF.csv,GHI.csv will be REPLACED every hour

Using the 3rd hour example for reference to what i'm trying to achieve

ABC.csv

Debit Count     VALUE
hour 1          5
hour 2          81
hour 3          15

DEF.csv

Score Count  
hour 1          31
hour 2          66
hour 3          9

GHI.csv

Receipt Count
age logs        23
bus logs        21
pig logs        7
dog logs        40

What I tried doing (Edited)

awk f="ABC.csv DEF.csv GHI.csv" '
  BEGIN {split(f,files)} /^Debit/ /^Score/ /^Receipt/ {n++} {print>files[n]}' FileA.csv

This question was closed for lack of focus in a previous post and i was given the option to either edit or re-post the question. I decided to re-post the question with better clarity so others that may have seen it before could see it again. Thanks

`'lack of focus'` (aka `'needs more focus'`) is a catchall that includes ... show the code you've attempted and the (wrong) output generated by your code; and in this case it's not clear where the `7th hour` version of the file comes into play (eg, do we update/replace the ABC/DEF/GHI.csv files? do we create a new set of ABC/DEF/GHI.csv files? something else?) — markp-fuso, Aug 02 '23 at 19:10
It'd make more sense for whatever is producing this to use a database like SQLite instead of a tsv file (you could then readily export the data as tsv on demand) — Shawn, Aug 02 '23 at 19:12
That isn't a CSV. First, there are no comma separators (what is the field separator, TAB?). Second, a CSV should normally have the same columns in every row. — Barmar, Aug 02 '23 at 19:17
the question may be `'clearer'` to you because you are intimately familiar with the data and what you wish to accomplish; for the rest of us we are limited to what you put in the question (ie, we can't read your mind); the latest edit is less than 'clear' — markp-fuso, Aug 02 '23 at 19:19
@markp-fuso it has been updated. The 7th hour version is just an example of how the file will look like after 7 hours. and YES, the set of ABC/DEF/GHI.csv files will be REPLACED each hour — igbins09, Aug 02 '23 at 19:22
@Barmar I already stated in the question that its a TAB DELIMITED file — igbins09, Aug 02 '23 at 19:25
It's still too broad. Please read [ask]. You might also want to read [Why is "Can someone help me?" not an actual question?](//meta.stackoverflow.com/q/284236/4518341) But you could *extract* a useful question(s) out of this problem, for example, "How do I use sed to get lines starting from one spot up until another?" though I bet that's already been asked; from a quick search I found something similar: [How to select first occurrence between two patterns including them - Unix & Linux](//unix.stackexchange.com/q/180663/117037). — wjandrea, Aug 02 '23 at 19:26
Also, the AWK answers you got on the last question look good (though I don't know enough AWK to say for sure), so what was the problem you had trying to use them? — wjandrea, Aug 02 '23 at 19:26
BTW, that's not a CSV; CSVs have the headers only at the top of the file. — wjandrea, Aug 02 '23 at 19:26
Honestly, I think all in all (and esp with the comments), the question is clear enough to be answered. If an assumption needs to be made, the answerer can do so (but that sort of thing is normal). I'd stop the downvoting. — Lover of Structure, Aug 02 '23 at 19:31
Hi. From my read of the question, the circumstance of the file changing every hour seems irrelevant: the program (you seek) doesn't need to know what happened before, it just needs to work on the input in the here-and-now. — Zach Young, Aug 02 '23 at 19:31
@wjandrea The header at the top of the file is ```Debit Count VALUE``` . if you notice there's no ```VALUE``` next to ```score count``` and ```receipt count``` . Also, my question is basically asking how to use awk/sed/grep to split the csv file into 3 separate files — igbins09, Aug 02 '23 at 19:33
@igbins09 I'm saying *only* at the top of the file. `Score Count` and `Receipt Count` are additional headers. — wjandrea, Aug 02 '23 at 19:34
*But* limiting yourself to one tool (say, `awk`, since the first answer below uses it) is better – otherwise the question can be regarded as "too broad". — Lover of Structure, Aug 02 '23 at 19:35
you've mentioned attempts at using `grep` and `sed` (along with `cut`) sooooo, include your coding attempts (and a description of how they failed to generate the desired results) — markp-fuso, Aug 02 '23 at 19:36

score 5 · Accepted Answer · answered Aug 02 '23 at 19:22

5

Use awk to change the output file when you get to each header line.

awk '/Debit Count/ { of="ABC.csv" }
     /Score Count/ { of="DEF.csv" }
     /Receipt Count/ { of="GHI.csv" }
     {print >of}' FileA.csv

answered Aug 02 '23 at 19:22

Barmar

741,623
53
500
612

score 3 · Answer 2 · answered Aug 03 '23 at 07:48

This might work for you (GNU csplit):

csplit -szfX -n1 file '/^[DSR]/' '{*}' && mv X0 ABC.csv && mv X1 DEF.csv && mv X2 GHI.csv

Split file based on the first character of the headers. The -z option elides the empty first file and the -s keeps the output silent. The remaining options are unnecessary but make for a neat one-liner.

Ed Morton · Answer 3 · 2023-08-03T23:53:54.217

Your code under "What I tried doing (Edited)":

awk f="ABC.csv DEF.csv GHI.csv" '
  BEGIN {split(f,files)} /^Debit/ /^Score/ /^Receipt/ {n++} {print>files[n]}' FileA.csv

was close, it should have been:

awk -v f="ABC.csv DEF.csv GHI.csv" '
  BEGIN {split(f,files)} /^Debit/ || /^Score/ || /^Receipt/ {n++} {print>files[n]}' FileA.csv

or:

awk -v f="ABC.csv DEF.csv GHI.csv" '
  BEGIN {split(f,files)} /^(Debit|Score|Receipt)/ {n++} {print>files[n]}' FileA.csv

By the way, your input file is neither CSV, nor TSV. It's not CSV as it's not comma-separated, it's tab-separated, and it's neither because both formats only have 1 header row as the first line, they don't have multiple header rows within the file. So it's just a plain old text file, with some headers sprinkled throughout and tab-separated name-value pairs under the headers and so your input file should be named FileA.txt, not FileA.csv nor FileA.tsv.

Even your 2nd 2 output files aren't TSV since the header line is just a string about the file, not the names of the 2 columns in the file - if you changed Receipt<blank>Count to Receipt<tab>Count or Receipt Count<tab>Value or similar to have column names THEN it'd be valid TSV and you could name your output files GHI.tsv, etc. The file suffixes matter for other people or subsequent tools understanding their contents.

You also don't actually need to test for the strings Debit, etc. you just need to change output file every time $2 is empty and I don't see any benefit to passing the output file names in with a variable rather than just making them a string in the script.

So your code should really be:

$ awk '
    BEGIN {
        split("ABC.tsv DEF.tsv GHI.tsv",files)
        FS = OFS = "\t"
    }
    NR == 1  { hd=$2; n++ }
    $2 == "" { $2=hd; n++ }
    { print > files[n] }
' FileA.txt

$ head *.tsv
==> ABC.tsv <==
Debit Count     VALUE
hour 1  5
hour 2  81
hour 3  15
hour 4  20
hour 5  52
hour 6  33
hour 7  35

==> DEF.tsv <==
Score Count     VALUE
hour 1  31
hour 2  66
hour 3  9
hour 4  112
hour 5  15
hour 6  38
hour 7  21

==> GHI.tsv <==
Receipt Count   VALUE
age logs        13
bus logs        28
pig logs        85
dog logs        55

thanks for this. this clears up the whole TSV/CSV issue. i should have just copied and pasted from my editor but i manually typed it all out hence why it appeared as a .txt form and i could see how that was misleading as well — igbins09, Aug 03 '23 at 20:13

Split dynamic TSV/CSV file into 3 separate files using Bash script with Awk

3 Answers3