0

I have a file, dataset.nt, which isn't too large (300Mb). I also have a list, which contains around 500 elements. For each element of the list, I want to count the number of lines in the file which contain it, and add that key/value pair to a dictionary (the key being the name of the list element, and the value the number of times this element appears in the file).

This is the first thing I tired to achieve that result:

mydict = {}

for i in mylist:
    regex = re.compile(r"/Main/"+re.escape(i))
    total = 0
    with open("dataset.nt", "rb") as input:
        for line in input:
            if regex.search(line):
                total = total+1
    mydict[i] = total

It didn't work (as in, it runs indefinitely), and I figured I should find a way not to read each line 500 times. So I tried this:

mydict = {}

with open("dataset.nt", "rb") as input:
    for line in input:
        for i in mylist:
            regex = re.compile(r"/Main/"+re.escape(i))
            total = 0
            if regex.search(line):
                total = total+1
            mydict[i] = total

Performance din't improve, the script still runs indefinitely. So I googled around, and I tried this:

mydict = {}

file = open("dataset.nt", "rb")

while 1:
    lines = file.readlines(100000)
    if not lines:
        break
    for line in lines:
        for i in list:
            regex = re.compile(r"/Main/"+re.escape(i))
            total = 0
            if regex.search(line):
                total = total+1
            mydict[i] = total

That one has been running for the last 30 minutes, so I'm assuming it's not any better.

How should I structure this code so that it completes in a reasonable amount of time?

kormak
  • 495
  • 2
  • 5
  • 15
  • 3
    With your last one, you _definitely_ want to move your regex creation out of the loops. Build them into a list first, then look up or something. – matsjoyce Oct 29 '14 at 17:31
  • this can be useful for part of it https://docs.python.org/2/library/collections.html#collections.Counter – Anentropic Oct 29 '14 at 17:38
  • Try to use `in` instead of regex and see if that helps. If there's any structure to the file such that those words only appear in certain positions, it may cut down the search too. – loopbackbee Oct 29 '14 at 17:38
  • @goncalopp Could you give me a documentation link for that 'in' function? – kormak Oct 29 '14 at 17:42
  • agree with @matsjoyce and goncaloop... it doesn't look like you need a regex, and if you did you should move the regex compilation out of the loop – Anentropic Oct 29 '14 at 17:43
  • @kormak Docs: https://docs.python.org/3.4/tutorial/datastructures.html#more-on-conditions – matsjoyce Oct 29 '14 at 17:46
  • possible duplicate of [In vs regular expressions with list of words](http://stackoverflow.com/questions/24611784/in-vs-regular-expressions-with-list-of-words). The case is almost the same, with the same methods applicable. – ivan_pozdeev Oct 29 '14 at 17:51

3 Answers3

1

I'd favor a slight modification of your second version:

mydict = {}

re_list = [re.compile(r"/Main/"+re.escape(i)) for i in mylist]
with open("dataset.nt", "rb") as input:
    for line in input:
        # any match has to contain the "/Main/" part
        # -> check it's there
        # that may help a lot or not at all
        # depending on what's in your file
        if not '/Main/' in line:
            continue 

        # do the regex-part
        for i, regex in zip(mylist, re_list):
            total = 0
            if regex.search(line):
                total = total+1
            mydict[i] = total

As @matsjoyce already suggested, this avoids re-compiling the regex on each iteration. If you really need to that many different regex patterns then I don't think there's much you can do.

Maybe it's worth checking if you can regex-capture whatever comes after "/Main/" and then compare this to your list. That may help reducing the amount of "real" regex searches.

sebastian
  • 9,526
  • 26
  • 54
  • Since `mylist` and `re_list` are both invariant you could extract the `zip()` call out of the loop too. Also, for each line, you discard the previous total from `my_dict[i]` – bruno desthuilliers Oct 29 '14 at 17:55
0

Looks like a good candidate for some map/reduce like parallelisation... You could split your dataset file in N chunks (where N = how many processors you have), launch N subprocesses each scanning one chunk, then sum the results.

This of course doesn't prevent you from first optimizing the scan, ie (based on sebastian's code):

targets = [(i, re.compile(r"/Main/"+re.escape(i))) for i in mylist]
results = dict.fromkeys(mylist, 0)

with open("dataset.nt", "rb") as input:
    for line in input:
        # any match has to contain the "/Main/" part
        # -> check it's there
        # that may help a lot or not at all
        # depending on what's in your file
        if '/Main/' not in line:
            continue 

        # do the regex-part
        for i, regex in targets:
            if regex.search(line):
                results[i] += 1

Note that this could be better optimized if you posted a sample from your dataset. If for exemple your dataset can be sorted on "/Main/{i}" (using the system sort program for exemple), you wouldn't have to check each line for each value of i. Or if the position of "/Main/" in the line is known and fixed, you could use a simple string comparison on the relevant part of the string (which can be faster than a regexp).

bruno desthuilliers
  • 75,974
  • 6
  • 88
  • 118
  • Oh and yes: optimization is a difficult art and we are often bad at guessing where the real bottlenecks are, so once you're done with the obvious (factoring invariants out of the loop etc) you're best bet is still to use a profiler. – bruno desthuilliers Oct 29 '14 at 18:52
  • This answer is a bit confusing to me - I don't have a very advanced knowledge of Python. 1) By "sorting on Main", do you mean sorting each line of the file alphabetically? 2) If the file was sorted, what would I have to change in order to avoid checking each line? – kormak Oct 29 '14 at 20:00
  • @kormak I meant "sorting lines according to the target values" - what is supposed to come after "/Main/". If you can easily sort the input (lines) on this you avoid the inner loop (=~500 iterations per line). – bruno desthuilliers Oct 29 '14 at 20:33
0

The other solutions are very good. But, since there is a regex for each element, and is not important if the element appears more than once per line, you could count the lines containing target expression using re.findall.

Also after certain ammount of lines is better to read the hole file (if you have enough memory and it isn't a design restriction) to memory.

    import re

    mydict = {}
    mylist = [...] # A list with 500 items
    # Optimizing calls
    findall = re.findall  # Then python don't have to resolve this functions for every call
    escape = re.escape

    with open("dataset.nt", "rb") as input:
        text = input.read() # Read the file once and keep it in memory instead access for read each line. If the number of lines is big this is faster.
        for elem in mylist:
            mydict[elem] = len(findall(".*/Main/{0}.*\n+".format(escape(elem)), text)) # Count the lines where the target regex is.

I test this with a file of size 800Mb (I wanted to see how much time take load a file as big like this into memory, is more fast that you would think).

I don't test the whole code with real data, just the findall part.

Raydel Miranda
  • 13,825
  • 3
  • 38
  • 60
  • I think ypu meant: len(findall(".*/Main/{0}.*\n+".format(escape(elem)),text)) Also, I'm not sure using * in a regex is good idea when trying to improve performance? – kormak Oct 29 '14 at 19:06
  • You was right about the `, text))` missing part. But about the regular expression, well you can modify it in order to not use `.*`, perhaps a positive look forward would do the work. But I think you get the idea: load the entire file once, and optimize funcion calls using statements like: `findall = re.findall`. – Raydel Miranda Oct 29 '14 at 20:35