I'm trying to build a few sets of int-pairs from a huge file. Each set in a typical file contains about a few million lines to parse and build one set from. I have created the following code but it takes >36hours for just one set made from 2 million lines!!
Input file (a few million lines like this): starts with
*|NET 2 0.000295965PF ... //unwanted sections R2_42 2:1 2:2 3.43756e-05 $a=2.909040 $lvl=99 $llx=15.449 $lly=9.679 $urx=17.309 $ury=11.243 R2_43 2:2 2:3 0.805627 $l=0.180 $w=1.564 $lvl=71 $llx=16.199 $lly=9.679 $urx=16.379 $ury=11.243 $dir=0 R2_44 2:2 2:4 4.16241 $l=0.930 $w=1.564 $lvl=71 $llx=16.379 $lly=9.679 $urx=17.309 $ury=11.243 $dir=0 R2_45 2:3 2:5 0.568889 $a=0.360000 $lvl=96 $llx=15.899 $lly=10.185 $urx=16.499 $ury=10.785 R2_46 2:3 2:6 3.35678 $l=0.750 $w=1.564 $lvl=71 $llx=15.449 $lly=9.679 $urx=16.199 $ury=11.243 $dir=0 R2_47 2:5 2:7 0.0381267 $l=0.301 $w=0.600 $lvl=8 $llx=16.199 $lly=10.200 $urx=16.500 $ury=10.800 $dir=0 R2_48 2:5 2:8 0.0378733 $l=0.299 $w=0.600 $lvl=8 $llx=15.900 $lly=10.200 $urx=16.199 $ury=10.800 $dir=0 *|NET OUT 0.000895965PF ...etc
Finally I need to build a set of integer pairs from the above where the integers are indexes of a list made from column 2 and column 3 of the file. [(2:1,2:2), (2:2,2:3), (2:2,2:4), (2:3,2:5), (2:3,2:6), (2:5,2:7), (2:5,2:8)] becomes [(0,1),(1,2),(1,3),(2,4),(2,5),(4,6),(4,7)]
I coded this:
if __name__ == '__main__':
with open('myspf') as infile, open('tmp','w') as outfile:
copy = False
allspf = []
for line in infile:
if line.startswith("*|NET 2"):
copy = True
elif line.strip() == "":
copy = False
elif copy:
#capture col2 and col3
if line.startswith("R"):
allspf.extend(re.findall(r'^R.*?\s(.*?)\s(.*?)\s', line))
final = f6(list(itertools.chain(*allspf))) //to get unique list
#build the finalpairs again by index: I've found this was the bottleneck
for x in allspf:
left,right = x
outfile.write("({},{}),".format(final.index(left),final.index(right)))
pair = []
f = open('tmp')
pair = list(ast.literal_eval(f.read()))
f.close()
fopen = open('hopespringseternal.txt','w')
fopen.write((json.dumps(construct_trees_by_TingYu(pair), indent=1)))
fopen.close()
def f6(seq):
# Not order preserving
myset = set(seq)
return list(myset)
The bottleneck is in the 'for x in allspf' loop, and the procedure construct_trees_by_TingYu itself also ran out of memory after I gave it the millions items set. The procedure from this guy requires the entire set all at once: http://xahlee.info/python/python_construct_tree_from_edge.html
The final output is a tree from parent to child:
{ "3": { "1": { "0": {} } }, "5": { "2": { "1": { "0": {} } } }, "6": { "4": { "2": { "1": { "0": {} } } } }, "7": { "4": { "2": { "1": { "0": {} } } } } }