4

I have a regular expression which I tested successfully on http://regexpal.com/ :

^(\".+?\"),\d.+?,"X",-99,-99,-99,-99,-99,-99,-99,(\d*),(\d*)

Where my test data looks like:

"AB101AA",10,"X",-99,-99,-99,-99,-99,-99,-99,394251,806376,179,"S00","SN9","00","QA","MH","X"
"AB101AF",10,"X",-99,-99,-99,-99,-99,-99,-99,394181,806429,179,"S00","SN9","00","QA","MH","X"
"AB101AG",10,"X",-99,-99,-99,-99,-99,-99,-99,394251,806376,179,"S00","SN9","00","QA","MH","X"
"AB101AH",10,"X",-99,-99,-99,-99,-99,-99,-99,394371,806359,179,"S00","SN9","00","QA","MH","X"
"AB101AJ",10,"X",-99,-99,-99,-99,-99,-99,-99,394171,806398,179,"S00","SN9","00","QA","MH","X"
"AB101AL",10,"X",-99,-99,-99,-99,-99,-99,-99,394331,806530,179,"S00","SN9","00","QA","MH","X"

I want to replace it with \1,\2,\3 on each line so for example line 1 would give

"AB101AA",394251,806376

How can I run this regex search & replace against all csv files in my folder in osx? I tried using sed but that complains with a syntax error (plus I'm unsure it will support this regex?). Additionaly, will the ^ (begining of line) and $ (end of line) anchors work line by line, or will they match the begin and end of the file?

UPDATE: Some good responses with cut, awk ect that get specific fields from the csv, but I've recently learnt I need to take the numbers from that list and chop them into 2 sub-values, so my example output from above would need to look like:

"AB101AA",3,94251,8,06376

As far as I know, I need to use a regex for this.

Matt Roberts
  • 26,371
  • 31
  • 103
  • 180
  • For your second question, the answer is that in Javascript you have to activate the meaning of ^ and $ as matching at begining and ends of LINES, that's not automatic: http://www.regular-expressions.info/anchors.html – eyquem Jun 26 '11 at 20:47

3 Answers3

5

You would like to extract field 1, 11 and 12? For a task like this, awk or cut really excells! E.g.

awk -F, '{print $1, $11, $12}' input

using cut:

cut -d, -f1,11,12 input 

using perl. -a turns on autosplit mode – perl will automatically split input lines on whitespace into the @F array. -F is used in conjunction with -a, to choose the delimiter on which to split lines.

perl -F, -lane 'printf "%s, %d, %d\n", $F[0], $F[10], $F[11]' input 

...and finally, a pure bash solution

#!/bin/bash
IFS=,
while read -ra ARRAY;
do
    echo ${ARRAY[0]}, ${ARRAY[10]}, ${ARRAY[11]}
done < input
Fredrik Pihl
  • 44,604
  • 7
  • 83
  • 130
  • Wow thanks, I had no idea you could do that stuff with awk and cut. I've just found that I need to chop up my number fields so I think I need to stick with a regex, but this is good info. – Matt Roberts Jun 27 '11 at 12:03
3
for file in *csv; do
    cp $file "${file}.bak && \
    awk -F "," 'BEGIN OFS=","} {print $1,$11,$12}' ${file}.bak > ${file}
done

Or

sed -i.bak 's/^\("[^"]\+"\),\d\+,"X",-99,-99,-99,-99,-99,-99,-99,\([0-9]\+\),\([0-9]\+\)/\1,\2,\3/' FILE(S)

E.g:

$ sed 's/^\("[^"]\+"\),[0-9]\+,"X",-99,-99,-99,-99,-99,-99,-99,\([0-9]\+\),\([0-9]\+\).*/\1,\2,\3/' <<EOF                                                   
"AB101AA",10,"X",-99,-99,-99,-99,-99,-99,-99,394251,806376,179,"S00","SN9","00","QA","MH","X"       
"AB101AF",10,"X",-99,-99,-99,-99,-99,-99,-99,394181,806429,179,"S00","SN9","00","QA","MH","X"
"AB101AG",10,"X",-99,-99,-99,-99,-99,-99,-99,394251,806376,179,"S00","SN9","00","QA","MH","X"
"AB101AH",10,"X",-99,-99,-99,-99,-99,-99,-99,394371,806359,179,"S00","SN9","00","QA","MH","X"
"AB101AJ",10,"X",-99,-99,-99,-99,-99,-99,-99,394171,806398,179,"S00","SN9","00","QA","MH","X"
"AB101AL",10,"X",-99,-99,-99,-99,-99,-99,-99,394331,806530,179,"S00","SN9","00","QA","MH","X"
EOF   
"AB101AA",394251,806376
"AB101AF",394181,806429
"AB101AG",394251,806376
"AB101AH",394371,806359
"AB101AJ",394171,806398
"AB101AL",394331,806530
$   

HTH

Zsolt Botykai
  • 50,406
  • 14
  • 85
  • 110
  • Thanks - I wasn't wrapping my sed command in quotes which was giving my initial error. Trying again with the command sed -i.bak 's/^(\".+?\"),\d.+?,"X",-99,-99,-99,-99,-99,-99,-99,(\d*),(\d*)/\1,\2,\3/' dh.csv - gives sed: 1: "s/^(\".+?\"),\d.+?,"X", ...": \1 not defined in the RE – Matt Roberts Jun 27 '11 at 12:06
  • You did not used my regex, that's start with: `'s/^\("[^`, and you had used (at least in your comment): `s/^(\".+?\`. In `sed` regexes you must escape `(` pairs for backrefferncing to work. – Zsolt Botykai Jun 27 '11 at 12:20
  • Ah, you're correct, I didn't ;) I tried it but the output was the same - I am assuming this is a fault in my regex, although I'm not sure why yet.. Did you find it worked for you with the test data I posted? – Matt Roberts Jun 27 '11 at 12:41
  • I made a mistake too (forgot that sed doesn't handle `\d` ad `[0-9]`), but updated the solution. So now it works for me. – Zsolt Botykai Jun 27 '11 at 12:59
  • Gah! That doesn't work for me! Running exactly what you run is giving me the same output as the input. I'm on osx which uses the BSD version of sed - looks like there are some differences in how this works.. – Matt Roberts Jun 27 '11 at 19:53
  • GNU-Version of sed installed. All works. Great stuff - thanks for your help on that one! – Matt Roberts Jun 27 '11 at 20:09
  • Also, see this: http://stackoverflow.com/questions/2320564/variations-of-sed-between-osx-and-gnu-linux – Zsolt Botykai Jun 27 '11 at 20:30
0
cd folder
for file in $(find . -type f -name '*.csv')
do
    echo $file
    awk -F"," '{printf("%s,%s,%s\n", $1, $11, $12)}' $file > /tmp/${file}.$$
    #awk -F"," '/^(\".+?\"),[0-9]+?,"X",-99,-99,-99,-99,-99,-99,-99,([0-9]+),([0-9]+)/ {printf("%s,%s,%s\n", $1, $11, $12)}' $file > /tmp/${file}.$$
    #mv /tmp/${file}.$$ ${file}
done

Comment first awk and uncomment second awk, if you need the regular exp. Uncomment last mv after testing.

amit_g
  • 30,880
  • 8
  • 61
  • 118