2

I have a big CSV file that I need to cut into different pieces based on the value in one of the columns. My input file dataset.csv is something like this:

NOTE: edited to clarify that data is ,data, no spaces.

action,action_type, Result
up,1,stringA
down,1,strinB
left,2,stringC

So, to split by action_type I simply do (I need the whole matching line in the resulting file):

awk -F, '$2 ~ /^1$/ {print}' dataset.csv >> 1_dataset.csv
awk -F, '$2 ~ /^2$/ {print}' dataset.csv >> 2_dataset.csv

This works as expected but I am basicaly travesing my original dataset twice. My original dataset is about 5GB and I have 30 action_type categories. I need to do this everyday, so, I need to script the thing to run on its own efficiently.

I tried the following but it does not work:

# This is a file called myFilter.awk

{
action_type=$2;
if (action_type=="1") print $0 >> 1_dataset.csv;
else if (action_type=="2") print $0 >> 2_dataset.csv;
}

Then I run it as:

awk -f myFilter.awk dataset.csv

But I get nothing. Literally nothing, no even errors. Which sort of tell me that my code is simply not matching anything or my print / pipe statement is wrong.

Wilmar
  • 558
  • 1
  • 5
  • 16
  • 2
    Is your field separator one `,` or one `,` followed by a space? – Cyrus Oct 20 '20 at 21:10
  • 2
    if you have a space after the comma and before the data (eg, `1` or `2`), and your `awk` input delimiter is just a comma, then your tests become `1` == `1`, which is 'false'; see [this](https://stackoverflow.com/a/20601021) and [this](https://stackoverflow.com/q/9175801) for ideas on trimming leading/trailing whitespace – markp-fuso Oct 20 '20 at 21:13
  • @Cyrus, it is a `,`. Thanks! – Wilmar Oct 20 '20 at 21:14
  • @markp-fuso, it does not have spaces. I just edited the question to make it clear. Thanks! – Wilmar Oct 20 '20 at 21:17
  • 1
    Do you want the header line included in every output file? Do you have GNU awk (`awk --version`)? – Ed Morton Oct 20 '20 at 22:06
  • @EdMorton, I don't mind about the header line. If I have it is OK, otherwise it is not a big deal. I do have GNU awk: 5.1.0 – Wilmar Oct 21 '20 at 01:20

3 Answers3

4

You may try this awk to do this in a single command:

awk -F, 'NR > 1{fn = $2 "_dataset.csv"; print >> fn; close(fn)}' file
anubhava
  • 761,203
  • 64
  • 569
  • 643
  • 2
    Holy cow! This is brilliant @anubhava! It totally worked! Thank you! – Wilmar Oct 20 '20 at 21:23
  • 1
    That was an awesome way of contents extraction. The trick happens because of `print >> fn` part, which continuously appends the existed `fn` files as the original `file` progress. – George Vasiliou Oct 20 '20 at 21:28
  • 1
    Just make sure that any existing output files are deleted before running or `>>` will append to that existing file... – dawg Oct 20 '20 at 21:42
  • 1
    Hi @dawg. Yes, thanks for the reminder. I am also going to test other methods to find out which one is more efficient. In my 5GB file, this has been running for about 2 hours and I do not see it even close to be done (based on the size of the output files than at then their sum should match the size of the original file). – Wilmar Oct 20 '20 at 23:28
  • 1
    @Wilmar if you're looking for the fastest solution then update your question to say that. So far no-one has posted the fastest solution and chances are no-one will since you already accepted an answer. Also see my comment under your question and update your question to show the expected output given your posted sample input. – Ed Morton Oct 21 '20 at 00:48
  • 1
    Thank you @EdMorton. I am adding that as a separate question for, in fact, my objective was not speed in the initial question. – Wilmar Oct 21 '20 at 01:18
3

With GNU awk to handle many concurrently open files and without replicating the header line in each output file:

awk -F',' '{print > ($2 "_dataset.csv")}' dataset.csv

or if you also want the header line to show up in each output file then with GNU awk:

awk -F',' '
    NR==1 { hdr = $0; next }
    !seen[$2]++ { print hdr > ($2 "_dataset.csv") }
    { print > ($2 "_dataset.csv") }
' dataset.csv

or the same with any awk:

awk -F',' '
    NR==1 { hdr = $0; next }
    { out = $2 "_dataset.csv" }
    !seen[$2]++ { print hdr > out }
    { print >> out; close(out) }
' dataset.csv
Ed Morton
  • 188,023
  • 17
  • 78
  • 185
1

As currently coded the input field separator has not been defined.

Current:

$ cat myfilter.awk
{
action_type=$2;
if (action_type=="1") print $0 >> 1_dataset.csv;
else if (action_type=="2") print $0 >> 2_dataset.csv;
}

Invocation:

$ awk -f myfilter.awk dataset.csv

There are a couple ways to address this:

$ awk -v FS="," -f myfilter.awk dataset.csv

or

$ cat myfilter.awk
BEGIN {FS=","}
{
action_type=$2
if (action_type=="1") print $0 >> 1_dataset.csv;
else if (action_type=="2") print $0 >> 2_dataset.csv;
}

$ awk -f myfilter.awk dataset.csv
markp-fuso
  • 28,790
  • 4
  • 16
  • 36
  • 1
    Thank you very much @markp-fuso. This indeed works. I anubhava 'anubhava' answer as correct for it was beautifully concise and straightforward. – Wilmar Oct 20 '20 at 21:26
  • 1
    @Wilmar not a problem, anubhava is just too quick on that keyboard of his! :-) and I was more interested in pointing out the issue with the original code – markp-fuso Oct 20 '20 at 21:27
  • @mark-fuse. Absolutely, that helps a lot! Thanks again. – Wilmar Oct 20 '20 at 21:32