10
  • I have a file of 300m lines (inputFile), all with 2 columns separated by a tab.
  • I also have a list of 1000 unique items (vals).

I want to create a dictionary with column 1 as key and column 2 as value for all lines in inputFile where the first columns occurs in vals. A few items in vals do not occur in the file, these values have to be saved in a new list. I can use up to 20 threads to speed up this process.

What is the fastest way to achieve this?

My best try till now:

newDict = {}
foundVals = []
cmd = "grep \"" + vals[0]
for val in vals:
     cmd = cmd + "\|^"+val+"[[:space:]]"
cmd = cmd + "\" " + self.inputFile
p = subprocess.Popen(cmd, shell=True, stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
for line in iter(p.stdout.readline, ''):
    info = line.split()
    foundVals.append(info[0])
    newDict.update({info[0]:info[1]})
p.wait()
notFound = [x for x in vals if x not in set(foundVals)]

Example inputFile:

2       9913
3       9913
4       9646
...
594592886       32630
594592888       32630
594592890       32630

vals:

[1,2,594592888]

wanted dictionary:

{2:9913,594592888:32630}

And in notFound:

[1]
Jetse
  • 1,706
  • 2
  • 16
  • 22
  • 2
    Can you provide a sample of the input files and the desired output ? Just to make things clear. – jrjc Mar 18 '14 at 13:25
  • maybe try using python's open(file, r) command and see if that gives you a performance boost. – sshashank124 Mar 18 '14 at 13:28
  • @sshashank124 that was my first try, but this approach is much faster... – Jetse Mar 18 '14 at 13:35
  • @jeanrjc added an example – Jetse Mar 18 '14 at 13:36
  • @sshashank124, grep is incredibly well optimized code, using it to filter out the lines of interest will cut down on Python's load quite a bit (assuming only a small portion of the lines in the file of are interest). It has the added benefit of running in its own thread too. – Mark Ransom Mar 18 '14 at 13:59
  • @MarkRansom was a typo indeed, fixed – Jetse Mar 18 '14 at 14:01
  • If `val` is found on more than one line in your tsv file, you overwrite the earlier hits and end up storing only the most recent hit. That doesn't sound like what you want (going by "a dictionary with column 1 as key and column 2 as value for all lines in inputFile where the first columns occurs in vals"). – Alp Mar 18 '14 at 14:14
  • @Alp I didn't mention it, but the values in column 1 are ordered and unique. Your solution makes the grep command like this: grep "\|val1\|val2" file.txt instead of grep "val1\|val1\|val2" file.txt – Jetse Mar 18 '14 at 14:17
  • @Alp I'd change the `for` loop to a join with a generator: `cmd += "\\|".join("^" + val + "[[:space:]]" for val in vals)` then leave the `vals[0]` out. – Mark Ransom Mar 18 '14 at 15:09
  • 1
    @MarkRansom Just in the process of writing that! Thanks @Jetse: The crucial point is that if you include `val[0]` without the anchor and the space character class you might get a false hit. – Alp Mar 18 '14 at 15:11
  • How long does `grep` take on its own to process the file, without Python involved? How long does it take just to copy the file to null? This will be your lower bound. – Mark Ransom Mar 18 '14 at 15:11
  • @MarkRansom Almost all the time is spent in `grep`. Since each key occurs only once in the data and there are only a 1000 of them, the amount of work being done in Python is trivial. – Alp Mar 18 '14 at 15:12
  • So the real question is whether there's any way to improve on `grep`, which is why I asked you to compare it to a simple copy. Grep is well optimized, but I don't know how efficient it is with such a complicated expression. It's important when you do the comparison to make sure the file isn't in the OS cache or it will produce overly optimistic times. – Mark Ransom Mar 18 '14 at 15:16

3 Answers3

9

You clarified in a comment that each key occurs at most once in the data. It follows from that and the fact that there are only 1000 keys that the amount of work being done in Python is trivial; almost all your time is spent waiting for output from grep. Which is fine; your strategy of delegating line extraction to a specialized utility remains sound. But it means that performance gains have to be found on the line-extraction side.

You can speed things up some by optimizing your regex. E.g., instead of

^266[[:space:]]\|^801[[:space:]]\|^810[[:space:]]

you could use:

^\(266\|801\|810\)[[:space:]]

so that the anchor doesn't have to be separately matched for each alternative. I see about a 15% improvement on test data (10 million rows, 25 keys) with that change.

A further optimization is to unify common prefixes in the alternation: 266\|801\|810 can be replaced with the equivalent 266\|8\(01\|10\). Rewriting the 25-key regex in this way gives close to a 50% speedup on test data.

At this point grep is starting to show its limits. It seems that it's CPU-bound: iostat reveals that each successive improvement in the regex increases the number of IO requests per second while grep is running. And re-running grep with a warmed page cache and the --mmap option doesn't speed things up (as it likely would if file IO were a bottleneck). Greater speed therefore probably requires a utility with a faster regex engine.

One such is ag(source here), whose regex implementation also performs automatic optimization, so you needn't do much hand-tuning. While I haven't been able to get grep to process the test data in less than ~12s on my machine, ag finishes in ~0.5s for all of the regex variants described above.

Alp
  • 2,766
  • 18
  • 13
  • How does your ag command look like? I installed ag, when using exactly the same regex as in grep, ag find nothing while grep does... I used grep "^\(266\|801\|810\)[[:space:]]" file.txt and: ag "^\(266\|801\|810\)[[:space:]]" file.txt – Jetse Mar 18 '14 at 16:14
  • 2
    `ag` uses PCRE syntax for regular expressions, so you shouldn't escape the alternation operator (the `|`). Also, I'm not sure if it supports named character classes; you might need to replace `[[:space]]` with `\s`. Btw, for `grep` you need to escape the parentheses as well. – Alp Mar 18 '14 at 16:28
0

This is not terribly memory efficient (for a file of 300 million lines, that may amount to a problem). I can't think of a way to save the not-found values within a comprehension except by saving all the values (or reading the file twice). I don't think threads will help much, since the file I/O is likely going to be the performance bottleneck. I'm assuming that a tab is the delimiting character in the file. (You didn't say, but the example data looks to have a tab.)

vals = [1,2,594592888]

with open(self.inputfile,'r') as i_file:
    all_vals = {
        int(t[0]):int(t[1])
        for t in (
                line.strip().split('\t')
                for line in i_file
        )
    }

newDict = {
    t[0]:t[1] for t in filter(lambda t: t[0] in vals, all_vals.items())
}

notFound = list(set(all_vals.keys()).difference(newDict.keys()))
mojo
  • 4,050
  • 17
  • 24
  • IO bottlenecks are precisely the sort of situation where threading does help in Python. – Alp Mar 18 '14 at 14:05
  • This method is indeed memory inefficient (over 30G), this caused no problems. Still grep is much faster... – Jetse Mar 18 '14 at 14:47
0

If I understand you correctly, you don't want any file row that doesn't match you vals Since you're talking about huge files and quite smaller number of wanted values, I would go for something like:

vals_set = set(vals)
found_vals = {}

with open(inputfile,"r") as in_file:
    for line in in_file:
        line = line.split() # Assuming tabs or whitespaces
        if line[0] in vals_set:
            found_vals[line[0]] = line[1]


not_found_vals = vals_set.difference(found_vals)

It will be memory conservative, and you'll have you dict in found_vals and your list in not_found_vals. In fact, memory usage, AFAIK will depend only on the amount of vals you want to search for, not the size of the files.

EDIT:

I think that the easiest way to parallelize this task would be just by splitting the file and searching separately in each piece with a different process. This way you don't need to deal with communicating between threads (easier and faster, I think).

A good way to do it, since I deduce you're using BASH (you used grep :P ) is what is mentioned in this answer:

split -l 1000000 filename

will generate files with 1000000 lines each.

You could easily modify you script to save your matches into a new file for each process, and then merge the different output file.

Community
  • 1
  • 1
mgab
  • 3,147
  • 4
  • 19
  • 30
  • I see... uhm... I'll investigate on it and edit my answer. However, I already edited it suggesting an easy way to use your multiple CPU – mgab Mar 18 '14 at 14:57