I am somewhat new to the Linux environment. I looked all over for an answer to this -- apologies if this has been asked before.
I wrote an awk script that operates on a big text file (11 gigs, 40 columns, 48M rows). The script is called "cycle.awk." It replace a column with a new version of it. It requires the data to be sorted first by that particular column. In order to run the script on all the columns, I wrote a bash command like this:
cat input.csv |
sort -k 22 -t "," | awk -v val=22 -f cycle.awk |
sort -k 23 -t "," | awk -v val=23 -f cycle.awk |
sort -k 24 -t "," | awk -v val=24 -f cycle.awk |
sort -k 25 -t "," | awk -v val=25 -f cycle.awk |
sort -k 26 -t "," | awk -v val=26 -f cycle.awk |
sort -k 27 -t "," | awk -v val=27 -f cycle.awk |
sort -k 28 -t "," | awk -v val=28 -f cycle.awk |
sort -k 29 -t "," | awk -v val=29 -f cycle.awk |
sort -k 30 -t "," | awk -v val=30 -f cycle.awk |
sort -k 31 -t "," | awk -v val=31 -f cycle.awk |
sort -k 32 -t "," | awk -v val=32 -f cycle.awk |
sort -k 33 -t "," | awk -v val=33 -f cycle.awk |
sort -k 34 -t "," | awk -v val=34 -f cycle.awk |
sort -k 35 -t "," | awk -v val=35 -f cycle.awk |
sort -k 36 -t "," | awk -v val=36 -f cycle.awk |
sort -k 37 -t "," | awk -v val=37 -f cycle.awk |
sort -k 38 -t "," | awk -v val=38 -f cycle.awk |
sort -k 39 -t "," | awk -v val=39 -f cycle.awk |
sort -k 40 -t "," | awk -v val=40 -f cycle.awk |
sort -k 41 -t "," | awk -v val=41 -f cycle.awk > output.csv
I figure there must be a more elegant way to do this. How can I write a bash script that will allow me to pass the columns I want to apply my awk script and then run this kind of piping procedure without needing to produce any temporary data files? I am avoiding temporary files because the input file is so large and I am interested in optimal performance.
BTW, the script is as follows. It basically shortens the values of some columns for purposes of compressing the text file. Any pointers on how to tighten it up? This procedures takes about 10 hours to run.
BEGIN{ FS=","; OFS=","; count=1 }
NR == 1 { temp=$val }
{
if ( temp != $val ) {
temp=$val;
count++;
}
$val=count
print $0
}
Input typically looks something like this:
id,c1
1,abcd
2,efgh
3,abcd
4,abcd
5,efgh
where the corresponding output would be:
id,c1
1,1
2,2
3,1
4,1
5,2
Technically, it would be sorted by c1 but that's not the point.