I am attempting to rewrite some of my old bash scripts that I think are very inefficient (not to mention inelegant) and use some horrid piping...Perhaps somebody with real Python skills can give me some pointers...
The script makes uses of multiple temp files...another thing I think is a bad style and probably can be avoided...
It essentially manipulates INPUT-FILE
by first cutting out certain number of lines from the top (discarding heading).
Then it pulls out one of the columns and:
- calculate number of
raws = N
; - throws out all duplicate entries from this single column file (I use
sort -u -n FILE > S-FILE
).
After that, I create a sequential integer index from 1 to N and paste this new index column into the original INPUT-FILE
using paste command.
My bash script then generates Percentile Ranks for the values we wrote into S-FILE.
I believe Python leverage scipy.stats
, while in bash I determine number of duplicate lines (dupline) for each unique entry in S-FILE, and then calculated per-rank=$((100*($counter+$dupline/2)/$length))
, where $length= length of FILE and not S-FILE. I then would print results into a separate 1 column file (and repeat same per-rank as many times as we have duplines).
I would then paste this new column with percentile ranks back into INPUT-FILE (since I would sort INPUT-FILE by the column used for calculation of percentile ranks - everything would line up perfectly in the result).
After this, it goes into the ugliness below...
sort -o $INPUT-FILE $INPUT-FILE
awk 'int($4)>2000' $INPUT-FILE | awk -v seed=$RANDOM 'BEGIN{srand(seed);} {print rand()"\t"$0}' | sort -k1 -k2 -n | cut -f2- | head -n 500 > 2000-$INPUT-FILE
diff $INPUT-FILE 2000-$INPUT-FILE | sed '/^[0-9][0-9]*/d; s/^. //; /^---$/d' | awk 'int($4)>1000' | awk -v seed=$RANDOM 'BEGIN{srand(seed);} {print rand()"\t"$0}' | sort -k1 -k2 -n | cut -f2- | head -n 500 > 1000-$INPUT-FILE
cat 2000-$INPUT-FILE 1000-$INPUT-FILE | sort > merge-$INPUT-FILE
diff merge-$INPUT-FILE $INPUT-FILE | sed '/^[0-9][0-9]*/d; s/^. //; /^---$/d' | awk 'int($4)>500' | awk -v seed=$RANDOM 'BEGIN{srand(seed);} {print rand()"\t"$0}' | sort -k1 -k2 -n | cut -f2- | head -n 500 > 500-$INPUT-FILE
rm merge-$INPUT-FILE
Essentially, this is a very inelegant bash way of doing the following:
- RANDOMLY select 500 lines from $INPUT-FILE where value in column 4 is greater then 2000 and write it out to file 2000-$INPUT-FILE
- For all REMAINING lines in $INPUT-FILE, randomly select 500 lines where value in column 4 is greater then 1000 and write it out to file 1000-$INPUT-FILE
- For all REMAINING lines in $INPUT-FILE after 1) and 2), randomly select 500 lines where value in column 4 is greater then 500 and write it out to file 500-$INPUT-FILE
Again, I am hoping somebody can help me in reworking this ugly piping thing into a thing of python beauty! :) Thanks!