0

I'm trying to build a few sets of int-pairs from a huge file. Each set in a typical file contains about a few million lines to parse and build one set from. I have created the following code but it takes >36hours for just one set made from 2 million lines!!

Input file (a few million lines like this): starts with

*|NET 2 0.000295965PF
... //unwanted sections
R2_42 2:1 2:2 3.43756e-05 $a=2.909040 $lvl=99 $llx=15.449 $lly=9.679 $urx=17.309 $ury=11.243
R2_43 2:2 2:3 0.805627 $l=0.180 $w=1.564 $lvl=71 $llx=16.199 $lly=9.679 $urx=16.379 $ury=11.243 $dir=0
R2_44 2:2 2:4 4.16241 $l=0.930 $w=1.564 $lvl=71 $llx=16.379 $lly=9.679 $urx=17.309 $ury=11.243 $dir=0
R2_45 2:3 2:5 0.568889 $a=0.360000 $lvl=96 $llx=15.899 $lly=10.185 $urx=16.499 $ury=10.785
R2_46 2:3 2:6 3.35678 $l=0.750 $w=1.564 $lvl=71 $llx=15.449 $lly=9.679 $urx=16.199 $ury=11.243 $dir=0
R2_47 2:5 2:7 0.0381267 $l=0.301 $w=0.600 $lvl=8 $llx=16.199 $lly=10.200 $urx=16.500 $ury=10.800 $dir=0
R2_48 2:5 2:8 0.0378733 $l=0.299 $w=0.600 $lvl=8 $llx=15.900 $lly=10.200 $urx=16.199 $ury=10.800 $dir=0

*|NET OUT 0.000895965PF
...etc

Finally I need to build a set of integer pairs from the above where the integers are indexes of a list made from column 2 and column 3 of the file. [(2:1,2:2), (2:2,2:3), (2:2,2:4), (2:3,2:5), (2:3,2:6), (2:5,2:7), (2:5,2:8)] becomes [(0,1),(1,2),(1,3),(2,4),(2,5),(4,6),(4,7)]

I coded this:

if __name__ == '__main__':
    with open('myspf') as infile, open('tmp','w') as outfile:
        copy = False
        allspf = []
        for line in infile:
            if line.startswith("*|NET 2"):
                copy = True
            elif line.strip() == "":
                copy = False
            elif copy:
                #capture col2 and col3
                if line.startswith("R"):
                    allspf.extend(re.findall(r'^R.*?\s(.*?)\s(.*?)\s', line))
        final = f6(list(itertools.chain(*allspf))) //to get unique list 
        #build the finalpairs again by index: I've found this was the bottleneck
        for x in allspf:
            left,right = x
            outfile.write("({},{}),".format(final.index(left),final.index(right)))
    pair = []
    f = open('tmp')
    pair = list(ast.literal_eval(f.read()))
    f.close()

    fopen = open('hopespringseternal.txt','w')
    fopen.write((json.dumps(construct_trees_by_TingYu(pair), indent=1)))
    fopen.close()

def f6(seq):
    # Not order preserving    
    myset = set(seq)
    return list(myset)

The bottleneck is in the 'for x in allspf' loop, and the procedure construct_trees_by_TingYu itself also ran out of memory after I gave it the millions items set. The procedure from this guy requires the entire set all at once: http://xahlee.info/python/python_construct_tree_from_edge.html

The final output is a tree from parent to child:

{
"3": {
 "1": {
  "0": {}
 }
},
"5": {
 "2": {
  "1": {
   "0": {}
  }
 }
},
"6": {
 "4": {
  "2": {
   "1": {
    "0": {}
   }
  }
 }
},
"7": {
 "4": {
  "2": {
   "1": {
    "0": {}
   }
  }
 }
}
}

1 Answers1

0

Building a set is always O(n). You need to traverse the entire list to add each item to your set.

However, it does not look like you're even using the set operation in the code excerpt above.

If you are running out of memory, you probably want to iterate over the huge set, rather than wait for the entire set to be created and then pass it to construct_trees_by_TingYu (I have no idea what this is by the way). Also, you can create a generator to yield each item from the set, which will decrease your memory footprint. Whether 'construct_trees_by_TingYu' will handle a generator passed to it, I do not know.

Scott Skiles
  • 3,647
  • 6
  • 40
  • 64
  • The procedure came from this guy: http://xahlee.info/python/python_construct_tree_from_edge.html I have to send it the entire set. Yes, I'm a noob so I wasn't sure how to turn this into a set. – user2241323 Jun 29 '18 at 12:34
  • you should definitely link this in your question, and clarify what it is you are trying to do.... – Scott Skiles Jun 29 '18 at 12:36