Splitting CSV file and excluding column in output using bash, sed or awk

Question

I have a CSV file which contains data like the following:-

1,275,,,275,17.3,0,"2011-05-09 20:21:45"
2,279,,,279,17.3,0,"2011-05-10 20:21:52"
3,276,,,276,17.3,0,"2011-05-11 20:21:58"
4,272,,,272,17.3,0,"2011-05-12 20:22:04"
5,272,,,272,17.3,0,"2011-05-13 20:22:10"
6,278,,,278,17.3,0,"2011-05-13 20:24:08"
7,270,,,270,17.3,0,"2011-05-13 20:24:14"
8,269,,,269,17.3,0,"2011-05-14 20:24:20"
9,278,,,278,17.3,0,"2011-05-14 20:24:26"

This file contains 4432986 rows of data.

I wish to split the file out basing the new file name on the date in the last column.

Therefore based on the data above i would want 6 new files with the rows for each day in each file.

I would like the files named in YYYY_MM_DD format.

I would also like to ignore the first column in the output data

So file 2011_05_13 would contain the following rows, with the first column excluded:-

272,,,272,17.3,0,"2011-05-13 20:22:10"
278,,,278,17.3,0,"2011-05-13 20:24:08"
270,,,270,17.3,0,"2011-05-13 20:24:14"

I am planning on doing this on a linux box, so anything using any linux utilities would be cool, sed awk etc ??

@wildplasser havent tried anything yet, not sure where to start. — general exception, Apr 18 '12 at 20:55
Well, start with the sed manual. Try to find a way to perform search&replace. — wildplasser, Apr 18 '12 at 20:57
The cut utility won't handle quoted commas, just be careful. — Demosthenex, Apr 18 '12 at 21:01
See orangeoctopus' post below, have tried this and it works but its quite slow. An awk one liner would be nice. — general exception, Apr 18 '12 at 21:24
Naming the file `DD_MM_YYYY` is a good way to have a silly sorting order when you type `ls`. All the files from the first of every month, of every year, followed by all the files from the second of every month of every year, ... — Kaz, Apr 18 '12 at 21:24
@Kaz, I agree. OP has conflicting info regarding the output file format (YYYY_MM_DD is specified first, then, file 13_05_2001 is included as expected output). Date is best left untouched as YYYY-MM-DD imo. It's easier to output also. — Steve, Apr 18 '12 at 22:21
Have now amended question and output date format as YYYY_MM_DD. Thanks. — general exception, Apr 19 '12 at 07:49

Steve · Accepted Answer · 2012-04-19T11:59:50.440

6

Here's a one-liner for you in awk:

awk -F "," '{ split ($8,array," "); sub ("\"","",array[1]); sub (NR,"",$0); sub (",","",$0); print $0 > array[1] }' file.txt

Desired output achieved, although perhaps some of this code could be made more succinct. HTH.

EDIT:

Read code from left to right:

-F ","
Yes this sets the delimiter.
split ($8,array," ")
This splits the eighth column on the space and puts this info in an array called array.
sub ("\"","",array[1])
We take the first array element (this is a slice that's going to become our output file name) and substitute out the leading " symbol (We need to escape the " symbol so we put the \ character in front).
sub (NR,"",$0)
This conveniently removes the line number from the beginning of your file (NR is row number and $0 is of course the whole line of input before delimitation).
sub (",","",$0)
This removes the comma after the row number.
Now that we have a clean filename and a clean row of data we can write $0 to array[1]: print $0 > array[1].

FIX:

So if you'd prefer a underscore instead of a hypon, all we need to fix is array[1]. I've just added in a global substitution: gsub ("-","_",array[1]).

The updated code is:

awk -F "," '{ split ($8,array," "); sub ("\"","",array[1]); gsub ("-","_",array[1]); sub (NR,"",$0); sub (",","",$0); print $0 > array[1] }' file.txt

HTH.

edited Apr 19 '12 at 11:59

answered Apr 18 '12 at 22:07

Steve

51,466
13
89
103

See my comments regarding filename output above – Steve Apr 18 '12 at 22:23
could you explain what the above does to help me understand. I know that -F "," tells it to split by comma - i am unsure on the rest - thanks. – general exception Apr 19 '12 at 07:51
Works nicely, except produces files in yyyy-mm-dd, i would like to replace the dashes with underscores please. – general exception Apr 19 '12 at 07:54
Have worked out how this works by reading a couple of awk manuals. Many thanks. This seems like the most elegant solution as I was originally looking for an awk solution. – general exception Apr 19 '12 at 09:02
This worked fantastic, processed a 300MB input file resulting in about 400 output files in around 1 minute - thanks. – general exception Apr 20 '12 at 06:37

score 2 · Answer 2 · answered Apr 18 '12 at 22:11

2

You can use this awk command:

awk -F, 'BEGIN{OFS=",";} {dt=$8; gsub(/^"| .*"$/,"", dt);
$1=""; sub(/^,/, "", $0); print $0 > dt}' input.txt

answered Apr 18 '12 at 22:11

anubhava

761,203
64
569
643

Donald Miner · Answer 3 · 2012-04-18T21:04:21.453

1

A scripting language (perl/python) is likely your best choice here, but I liked the challenge of doing this in bash, so here it is.

 cat bigfile.txt | while read LINE;
  do echo $LINE >> `echo $LINE | cut -d, -f8 | cut -c2-11`.txt ;
 done

Basically, what this does is reads the file line by line in the while loop, then appends that line to a file based on the date.

The date is pulled out with a combination of two cut commands. The first cut pulls the last column (column 8) off using a comma delimiter (-d,), then the second cut pulls just the date by removing the first ", and then slurping up to character 11.

Now, to tackle the removal of the first column:

cat bigfile.txt | sed 's/^.*?,//'

This regular expression just removes everything before the first comma.

So, we'll replace the beginning of our while loop with this, leaving us with:

 cat bigfile.txt | sed 's/^.*?,//' | while read LINE;
  do echo $LINE >> `echo $LINE | cut -d, -f8 | cut -c2-11`.txt ;
 done

edited Apr 18 '12 at 21:04

answered Apr 18 '12 at 20:57

Donald Miner

38,889
8
95
118

Nice answer, but I prefer cut over sed for removing that first column. – Demosthenex Apr 18 '12 at 21:01
Based on this, have managed to do it like this:- head -10 out.csv | sed 1d | cut -d, -f2- | while read LINE; do echo $LINE >> `echo $LINE | cut -d, -f7 | cut -c2-11 | sed s/-/_/g`; done any improvements ??? – general exception Apr 18 '12 at 21:17
Ignore the head -10, only for testing purposes. Had to use sed 1d to remove the first row of file, and sed at the end to replace dashes with underscore. Anyone suggest a more elegant solution ? – general exception Apr 18 '12 at 21:17
1

This solution opens the result files once per written line, but only reads the input file once. If your input file read speed is somewhat decent (KB range or higher), you'll want to go with a sequence of grep's like @Demosthenex suggested. – thiton Apr 18 '12 at 21:56
Although this method works, it is quite slow as per @thiton explains. I have marked the awk solution by steve as the answer as this seems closer to what i was actually looking for. – general exception Apr 19 '12 at 09:04

score 1 · Answer 4 · answered Apr 18 '12 at 20:58

1

This monstrosity grabs all the unique dates and then greps for those keys in the original file saving them to files named by that key. Yes, useless use of cat, but trying to atomize the actions.

cat records.txt \
| cut -f8 -d, \
| cut -f1 -d ' ' \
| tr -d '"' \
| sort -u \
| while read DATE ; do \
    cat records.txt \
    | cut -f2- -d, \
    | egrep ",\"${DATE} [0-9]{2}:[0-9]{2}:[0-9]{2}\"" \
    > ${DATE}.txt
done

answered Apr 18 '12 at 20:58

Demosthenex

4,343
2
26
22

1

No it doesn't match your date format, but that'd be trivial to rename. However given the number of files you may have I'd suggest you reconsider staying in YYYY/MM/DD format. – Demosthenex Apr 18 '12 at 21:00
1

ah, i overlooked that too. i prefer the year first anyways so it sorts nicely. – Donald Miner Apr 18 '12 at 21:02

allenhwkim · Answer 5 · 2012-04-18T21:48:14.087

0

It must be simple

$ sed 's/^[0-9]*,//' your_gigantic_data.csv

edited Apr 18 '12 at 21:48

answered Apr 18 '12 at 21:12

allenhwkim

27,270
18
89
122

Sorry I dont understand your answer ? – general exception Apr 18 '12 at 21:19
the sed command will remove all start line numbers with comma. – allenhwkim Apr 18 '12 at 21:46

score 0 · Answer 6 · answered Apr 18 '12 at 22:36

0

This might work for you:

sed 's/^[^,]*,\(.*"\(....\)-\(..\)-\(..\).*\)/echo \1 >>\2_\3_\4.csv/' file | sh

or GNU sed:

sed 's/^[^,]*,\(.*"\(....\)-\(..\)-\(..\).*\)/echo \1 >>\2_\3_\4.csv/e' file

answered Apr 18 '12 at 22:36

potong

55,640
6
51
83

Splitting CSV file and excluding column in output using bash, sed or awk

6 Answers6

Linked