3

I have a CSV file which contains data like the following:-

1,275,,,275,17.3,0,"2011-05-09 20:21:45"
2,279,,,279,17.3,0,"2011-05-10 20:21:52"
3,276,,,276,17.3,0,"2011-05-11 20:21:58"
4,272,,,272,17.3,0,"2011-05-12 20:22:04"
5,272,,,272,17.3,0,"2011-05-13 20:22:10"
6,278,,,278,17.3,0,"2011-05-13 20:24:08"
7,270,,,270,17.3,0,"2011-05-13 20:24:14"
8,269,,,269,17.3,0,"2011-05-14 20:24:20"
9,278,,,278,17.3,0,"2011-05-14 20:24:26"

This file contains 4432986 rows of data.

I wish to split the file out basing the new file name on the date in the last column.

Therefore based on the data above i would want 6 new files with the rows for each day in each file.

I would like the files named in YYYY_MM_DD format.

I would also like to ignore the first column in the output data

So file 2011_05_13 would contain the following rows, with the first column excluded:-

272,,,272,17.3,0,"2011-05-13 20:22:10"
278,,,278,17.3,0,"2011-05-13 20:24:08"
270,,,270,17.3,0,"2011-05-13 20:24:14"

I am planning on doing this on a linux box, so anything using any linux utilities would be cool, sed awk etc ??

general exception
  • 4,202
  • 9
  • 54
  • 82

6 Answers6

6

Here's a one-liner for you in awk:

awk -F "," '{ split ($8,array," "); sub ("\"","",array[1]); sub (NR,"",$0); sub (",","",$0); print $0 > array[1] }' file.txt

Desired output achieved, although perhaps some of this code could be made more succinct. HTH.

EDIT:

Read code from left to right:

  • -F ","
    Yes this sets the delimiter.

  • split ($8,array," ")
    This splits the eighth column on the space and puts this info in an array called array.

  • sub ("\"","",array[1])
    We take the first array element (this is a slice that's going to become our output file name) and substitute out the leading " symbol (We need to escape the " symbol so we put the \ character in front).

  • sub (NR,"",$0)
    This conveniently removes the line number from the beginning of your file (NR is row number and $0 is of course the whole line of input before delimitation).

  • sub (",","",$0)
    This removes the comma after the row number.

  • Now that we have a clean filename and a clean row of data we can write $0 to array[1]: print $0 > array[1].

FIX:

So if you'd prefer a underscore instead of a hypon, all we need to fix is array[1]. I've just added in a global substitution: gsub ("-","_",array[1]).

The updated code is:

awk -F "," '{ split ($8,array," "); sub ("\"","",array[1]); gsub ("-","_",array[1]); sub (NR,"",$0); sub (",","",$0); print $0 > array[1] }' file.txt

HTH.

Steve
  • 51,466
  • 13
  • 89
  • 103
2

You can use this awk command:

awk -F, 'BEGIN{OFS=",";} {dt=$8; gsub(/^"| .*"$/,"", dt);
$1=""; sub(/^,/, "", $0); print $0 > dt}' input.txt
anubhava
  • 761,203
  • 64
  • 569
  • 643
1

A scripting language (perl/python) is likely your best choice here, but I liked the challenge of doing this in bash, so here it is.

 cat bigfile.txt | while read LINE;
  do echo $LINE >> `echo $LINE | cut -d, -f8 | cut -c2-11`.txt ;
 done

Basically, what this does is reads the file line by line in the while loop, then appends that line to a file based on the date.

The date is pulled out with a combination of two cut commands. The first cut pulls the last column (column 8) off using a comma delimiter (-d,), then the second cut pulls just the date by removing the first ", and then slurping up to character 11.


Now, to tackle the removal of the first column:

cat bigfile.txt | sed 's/^.*?,//'

This regular expression just removes everything before the first comma.

So, we'll replace the beginning of our while loop with this, leaving us with:

 cat bigfile.txt | sed 's/^.*?,//' | while read LINE;
  do echo $LINE >> `echo $LINE | cut -d, -f8 | cut -c2-11`.txt ;
 done
Donald Miner
  • 38,889
  • 8
  • 95
  • 118
  • Nice answer, but I prefer cut over sed for removing that first column. – Demosthenex Apr 18 '12 at 21:01
  • Based on this, have managed to do it like this:- head -10 out.csv | sed 1d | cut -d, -f2- | while read LINE; do echo $LINE >> `echo $LINE | cut -d, -f7 | cut -c2-11 | sed s/-/_/g`; done any improvements ??? – general exception Apr 18 '12 at 21:17
  • Ignore the head -10, only for testing purposes. Had to use sed 1d to remove the first row of file, and sed at the end to replace dashes with underscore. Anyone suggest a more elegant solution ? – general exception Apr 18 '12 at 21:17
  • 1
    This solution opens the result files once per written line, but only reads the input file once. If your input file read speed is somewhat decent (KB range or higher), you'll want to go with a sequence of grep's like @Demosthenex suggested. – thiton Apr 18 '12 at 21:56
  • Although this method works, it is quite slow as per @thiton explains. I have marked the awk solution by steve as the answer as this seems closer to what i was actually looking for. – general exception Apr 19 '12 at 09:04
1

This monstrosity grabs all the unique dates and then greps for those keys in the original file saving them to files named by that key. Yes, useless use of cat, but trying to atomize the actions.

cat records.txt \
| cut -f8 -d, \
| cut -f1 -d ' ' \
| tr -d '"' \
| sort -u \
| while read DATE ; do \
    cat records.txt \
    | cut -f2- -d, \
    | egrep ",\"${DATE} [0-9]{2}:[0-9]{2}:[0-9]{2}\"" \
    > ${DATE}.txt
done
Demosthenex
  • 4,343
  • 2
  • 26
  • 22
  • 1
    No it doesn't match your date format, but that'd be trivial to rename. However given the number of files you may have I'd suggest you reconsider staying in YYYY/MM/DD format. – Demosthenex Apr 18 '12 at 21:00
  • 1
    ah, i overlooked that too. i prefer the year first anyways so it sorts nicely. – Donald Miner Apr 18 '12 at 21:02
0

It must be simple

$ sed 's/^[0-9]*,//' your_gigantic_data.csv
allenhwkim
  • 27,270
  • 18
  • 89
  • 122
0

This might work for you:

sed 's/^[^,]*,\(.*"\(....\)-\(..\)-\(..\).*\)/echo \1 >>\2_\3_\4.csv/' file | sh

or GNU sed:

sed 's/^[^,]*,\(.*"\(....\)-\(..\)-\(..\).*\)/echo \1 >>\2_\3_\4.csv/e' file
potong
  • 55,640
  • 6
  • 51
  • 83