python file manipulations (bash script porting)

Question

I am attempting to rewrite some of my old bash scripts that I think are very inefficient (not to mention inelegant) and use some horrid piping...Perhaps somebody with real Python skills can give me some pointers...

The script makes uses of multiple temp files...another thing I think is a bad style and probably can be avoided...

It essentially manipulates INPUT-FILE by first cutting out certain number of lines from the top (discarding heading).
Then it pulls out one of the columns and:

calculate number of raws = N;
throws out all duplicate entries from this single column file (I use sort -u -n FILE > S-FILE).

After that, I create a sequential integer index from 1 to N and paste this new index column into the original INPUT-FILE using paste command.
My bash script then generates Percentile Ranks for the values we wrote into S-FILE.
I believe Python leverage scipy.stats, while in bash I determine number of duplicate lines (dupline) for each unique entry in S-FILE, and then calculated per-rank=$((100*($counter+$dupline/2)/$length)), where $length= length of FILE and not S-FILE. I then would print results into a separate 1 column file (and repeat same per-rank as many times as we have duplines).
I would then paste this new column with percentile ranks back into INPUT-FILE (since I would sort INPUT-FILE by the column used for calculation of percentile ranks - everything would line up perfectly in the result).

After this, it goes into the ugliness below...

sort -o $INPUT-FILE $INPUT-FILE

awk 'int($4)>2000' $INPUT-FILE | awk -v seed=$RANDOM 'BEGIN{srand(seed);} {print rand()"\t"$0}' | sort -k1 -k2 -n | cut -f2- | head -n 500 > 2000-$INPUT-FILE

diff $INPUT-FILE 2000-$INPUT-FILE | sed '/^[0-9][0-9]*/d; s/^. //; /^---$/d' | awk 'int($4)>1000' | awk -v seed=$RANDOM 'BEGIN{srand(seed);} {print rand()"\t"$0}' | sort -k1 -k2 -n | cut -f2- | head -n 500 > 1000-$INPUT-FILE

cat 2000-$INPUT-FILE 1000-$INPUT-FILE | sort > merge-$INPUT-FILE

diff merge-$INPUT-FILE $INPUT-FILE | sed '/^[0-9][0-9]*/d; s/^. //; /^---$/d' | awk 'int($4)>500' | awk -v seed=$RANDOM 'BEGIN{srand(seed);} {print rand()"\t"$0}' | sort -k1 -k2 -n | cut -f2- | head -n 500 > 500-$INPUT-FILE

rm merge-$INPUT-FILE

Essentially, this is a very inelegant bash way of doing the following:

RANDOMLY select 500 lines from $INPUT-FILE where value in column 4 is greater then 2000 and write it out to file 2000-$INPUT-FILE
For all REMAINING lines in $INPUT-FILE, randomly select 500 lines where value in column 4 is greater then 1000 and write it out to file 1000-$INPUT-FILE
For all REMAINING lines in $INPUT-FILE after 1) and 2), randomly select 500 lines where value in column 4 is greater then 500 and write it out to file 500-$INPUT-FILE

Again, I am hoping somebody can help me in reworking this ugly piping thing into a thing of python beauty! :) Thanks!

Can you add one example line from the file and an example of you invoking this script from the command line? — Jason Sperske, Mar 25 '13 at 22:14
Please look at the post preview before posting and make sure to avoid walls of text. It will only ever get your question ignored because it's too hard to decipher whats being said. — Serdalis, Mar 25 '13 at 22:14
As Serdalis mentioned, this is quite extensive. You may be better off getting started and breaking this up into parts, asking questions as you hit problems. — monkut, Mar 25 '13 at 22:18
@Serdalis Thank you for moderating my post, Sir! I agree, it's somewhat wordy...just trying to give a general background to the actual meat of the question, which starts with "After this, it goes into the ugliness below..". That's the main part I am curious to re-write. Other things are reasonably trivial... — John Durand, Mar 25 '13 at 22:43
@monkut I am mainly interested in re-writing the block listed after the phrase "After this, it goes into the ugliness below...". Thanks, and point about denseness of the post well taken! :) — John Durand, Mar 25 '13 at 22:44
@JasonSperske sure, Jason...so, the input file will have about 50k lines that looks like this: name 123456 765432 345676 (i.e. first column is a text string, the rest are number strings, total of 9 columns). After my calculation of percentile rank and index, you'll have a file with the same 9 columns of data + 2 additional columns of index and p-rank = total of 11 columns — John Durand, Mar 25 '13 at 22:48
I'm looking for things like separating characters, inputs passed form the command line stuff like that. It would be helpful to add this to the question text (you have more room for formatting) — Jason Sperske, Mar 25 '13 at 22:52
Here is a Python script that will read random lines from a file: http://stackoverflow.com/a/3010061/16959 it would be much extra work to adapt this to your needs — Jason Sperske, Mar 25 '13 at 22:57
@JasonSperske original file is tab delimited. the bash script itself is significantly larger then the section i described above...it goes into more manipulations and also calls out certain command line executables that aren't part of python. in fact, some of the awk manipulations i was doing end up making the file space delimited, but I then make it tab delimited again by running sed, i.e. `sed --posix -i -e 's/ /\t/g' FILE` — John Durand, Mar 25 '13 at 22:58
@JasonSperske right...this script can be adopted...what i am having difficulty with is dealing with random selection out of REMAINING lines after a new comparison parameter is used (>1000 vs >2000), i.e. transition from step 1 to step 2 in my list. — John Durand, Mar 25 '13 at 23:04

score 1 · Accepted Answer · answered Mar 26 '13 at 01:45

Two crucial points in the comments:

(A) The file is ~50k lines of ~100 characters. Small enough to comfortably fit in memory on modern desktop/server/laptop systems.

(B) The author's main question is about how to keep track of lines that have already been chosen, and don't choose them again.

I suggest three steps.

(1) Go through the file, making three separate lists -- call them u, v, w -- of the line numbers which satisfy each of the criteria. These lists may have more than 500 lines, and they may contain duplicates, but we will get rid of these problems in step (2).

u = []
v = []
w = []

with open(filename, "r") as f:
    for linenum, line in enumerate(f):
        x = int(line.split()[3])
        if x > 2000:
            u.append(x)
        if x > 1000:
            v.append(x)
        if x > 500:
            w.append(x)

(2) Choose line numbers. You can use the builtin Random.sample() to pick a sample of k elements from a population. We want to remove elements that have previously been chosen, so keep track of such elements in a set. (The "chosen" collection is a set instead of a list because the test "if x not in chosen" is O(log(n)) for a set, but O(n) for a list. Change it to a list and you'll see slowdown if you measure the timings precisely, though it might not be a noticeable delay for a data set of "only" 50k data points / 500 samples / 3 categories.)

import random
rand = random.Random()       # change to random.Random(1234) for repeatable results

chosen = set()
s0 = rand.sample(u, 500)
chosen.update(s0)
s1 = rand.sample([x for x in v if x not in chosen], 500)
chosen.update(s1)
s2 = rand.sample([x for x in w if x not in chosen], 500)
chosen.update(s2)

(3) Do another pass through the input file, putting lines whose numbers are s0 into your first output file, lines whose numbers are in s1 into your second output file, and lines whose numbers are in s2 into your third output file. It's pretty trivial in any language, but here's an implementation which uses Python "idioms":

linenum2sample = dict([(x, 0) for x in s0]+[(x, 1) for x in s1]+[(x, 2) for x in s2])

outfile = [open("-".join(x, filename), "w") for x in ["2000", "1000", "500"]]

try:
    with open(filename, "r") as f:
        for linenum, line in enumerate(f):
            s = linenum2sample.get(linenum)
            if s is not None:
                outfile[s].write(line)
finally:
    for f in outfile:
        f.close()

thank you for a detailed answer! couple of comments...1)in the third chunk of code the join() seems to work only when i put it in as join([x, filename]) 2)I've changed sample size from 500 to 100 for testing and i am getting the resulting three files that seem to fluctuate in size, i.e. first run on a big file with over 76k lines produced three files with 98, 100, and 100 lines respectfully. second run on the SAME file produced three files with 100, 98, and 94 lines each. Third run turned up files with 99, 98, and 99 lines. I am not quite following as too why...any ideas?? — John Durand, Mar 27 '13 at 03:50
apparently, there is a possibility of duplicates showing up in s0, s1, and s2...and dict() doesn't create duplicate entries - i.e. the number of lines in the output files gets reduced by the number of duplicates found in each of the sets... — John Durand, Mar 27 '13 at 08:40
ok, i bypassed this issue by using random.sample(set(u), 500) construct, etc. Thank You! Looks like the solutions holds out with small corrections! :) — John Durand, Mar 27 '13 at 08:50

score 0 · Answer 2 · answered Mar 27 '13 at 02:23

Break it up into easy pieces.

Read the file using csv.DictReader, or csv.reader if the headers are unusable. As you're iterating through the lines, check the value of column 4 and insert the lines into a dictionary of lists where the dictionary keys are something like 'gt_2000', 'gt_1000', 'gt_500'.
Iterate through your dictionary keys and for each, create a file and do a loop of 500 and for each iteration, use random.randint(0, len(the_list)-1) to get a random index of the list, write it to the file, then delete the item at that index from the list. If there could ever be fewer than 500 items in any bucket then this will require a tiny bit more.

python file manipulations (bash script porting)

2 Answers2