0

I'm trying to compare two large files: b.txt contains over 20 millions lines, a.txt contains 50.000 lines. Comparing them in a traditional way takes too much time. For example following code didn't finish in 5 hours:

b = [line.rstrip('\n') for line in open("b.txt")]
a = [line.rstrip('\n') for line in open("a.txt")]
for i in a:
    for j in b:
        if i in j:
            print(j)

I saw following suggestion for a similar question in Stackoverflow:

with open('b.txt') as b:
    blines = set(b)
    print(blines[0])
with open('a.txt') as a:
    for line in a:
        if line in blines:
            print(line)

The set() function works very fast and the code terminates in seconds. However, I need to find the exact string in blines which contains line variable. Since set() is not accessible by index, I couldn't achieve it. Is there a way to find that matching string, or do you have any other suggestions to make this comparasion faster than the first code.

ShadowRanger
  • 143,180
  • 12
  • 188
  • 271
Robert Zunr
  • 67
  • 1
  • 11

3 Answers3

3

From your comment, you say you need to handle (slightly edited for clarity):

line = 'abc' with blines = {'abcd', 'fg'}; if line in blines returns true but I need 'abcd' string, or it's index to find it

What you're doing can't be properly done with a set without combinatoric explosion. You want to handle arbitrary substrings, not just lines or words, which means blines would need to have every substring of every line in it to make the lookup succeed (so instead of just 'abcd' you'd also need to store 'a', 'b', 'c', 'd', 'ab', 'bc'', 'cd', 'abc', and 'bcd', and that's just for one short line, not 20M longer lines).

A better solution is a data structure that will let you find all your target words in a given string with a data structure that won't suffer combinatoric explosion, e.g. Aho-Corasick, for which a Python package, pyahocorasick, already exists to implement it efficiently.

Instead of loading all your 20 million lines (and who knows how many substrings of each line) of b into memory, just build an automaton from the 50,000 strings in a, then check each line of b against that automaton:

import ahocorasick
auto = ahocorasick.Automaton()    
with open("a.txt") as needles:
    for needle in needles:
        needle = needle.rstrip('\n')
        auto.add_word(needle, needle)
auto.make_automaton()

with open("b.txt") as haystacks:
    for lineno, line in enumerate(haystacks):
        for end_ind, found in auto.iter(line):
            print(f"Found {found!r} on line {lineno} at index {end_ind - len(found) + 1}: {line!r}")

That makes a single Aho-Corasick automaton in O(n) time (relative to size of a.txt), then scans it in O(n) time (relative to the size of b.txt); memory usage will be roughly proportionate to the size of a.txt (from prior tests, for 50,000 random needles 3-12 characters long, memory usage for the automaton would likely be in the 10-20 MB range), and doesn't suffer the combinatoric explosion of a set of all substrings. It will find all elements of a wherever they occur in b (even in the middle of a word, as in your example of needing to find abc in abcd) without additional memory overhead.

If you need to know the line number in a.txt, not just the line number in b.txt, just change the automaton building process to store the line number as well as the needle itself (you can associate anything with each word added, so a tuple is just as good as a str):

for lineno, needle in enumerate(needles):
    needle = needle.rstrip('\n')
    auto.add_word(needle, (lineno, needle))

then iterate later with:

for blineno, bline in enumerate(haystacks):
    for end_ind, (alineno, found) in auto.iter(line):

adjusting the output as desired.

ShadowRanger
  • 143,180
  • 12
  • 188
  • 271
  • This seems like a correct answer. TIL about `pyahocorasick` module. Thanks, upvoted! – Andrej Kesely Jul 22 '19 at 18:37
  • @AndrejKesely: Yeah, this problem is similar to the one I answered years ago with [a similar answer](https://stackoverflow.com/a/34820772/364696). Only reason I didn't mark it as a duplicate is that I needed to address the OP's mistaken impression that a `set` (or `dict`, or any simple related structure) was a viable approach. `pyahocorasick` is a niche tool, but when it's useful, it's *really* useful, with performance scaling sublinearly with the number of words to search for (so searching for 10x as many strings increases runtime by less than 2x), and memory scaling (roughly) linearly. – ShadowRanger Jul 22 '19 at 18:43
0

If you can guarantee that lines in b.txt are unique, so that you can use lookup table to find the position of any given line. (EDIT: lines don't actually have to be unique when using defaultdict. Thanks @ShadowRanger)

This means, you can create a dictionary with line as key and line number as the value, then use line_in_a in b_dict to check for existence of a line in file a in file b.

This should give you a similar performance, since both set and dict do hash lookups, but at the expense of memory space.

That would look like something like this:

from io import StringIO
from collections import defaultdict
import itertools

b = StringIO('''321
543
654
123
123
234''')

a = StringIO('''123
234
345''')



with b:
    lines = (line.rstrip('\n') for line in b)
    blookup = defaultdict(set)
    for i, line in enumerate(lines):
        blookup[line].add(i)

with a:    
    for line in a:
        line = line.rstrip('\n')
        if line in blookup:
            line_nos = blookup[line]
            print(line, line_nos)

output:

123 {3, 4}
234 {5}

Note: Keep in mind this approach considers only exact matches, it does not search for substrings.

abdusco
  • 9,700
  • 2
  • 27
  • 44
  • The lines needn't be unique if the `dict` maps each line to a `list` of line numbers. `collections.defaultdict(list)` makes this easy to build. – ShadowRanger Jul 22 '19 at 18:05
  • Good idea, let me take a look at it. Edit: Done – abdusco Jul 22 '19 at 18:07
  • Thank you that works but can I also print what string that 'blookup[line]' holds? – Robert Zunr Jul 22 '19 at 18:20
  • It's the `line` variable ? You already know it by definition if you're matching a line in a to a line in b. – abdusco Jul 22 '19 at 18:21
  • I think I explained my question bad. actually I'm trying to find lines in b which 'contains' lines from a. – Robert Zunr Jul 22 '19 at 18:25
  • 2
    Hmm. On second thought, @ShadowRanger's answer touches on an important point. You seem to want substring lookups between two sets of lines. This answer will give you exact matches. – abdusco Jul 22 '19 at 18:33
0

Note: Keep in mind this approach considers only exact matches, it does not search for substrings - for searching substrings, check @ShadowRanger answer.

This script first builds index from large file (word:{set of line numbers where the word exists in large file}):

data_large_file = '''
Lorem ipsum dolor sit amet, consectetur adipiscing elit,
sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu
fugiat nulla pariatur.
Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia
deserunt mollit anim id est laborum.'''

data_search = '''
ad minim veniam
non proident
non NOTFOUND proident'''

data = [d.rstrip() for d in data_large_file.splitlines() if d.rstrip()]
data_s = [d.rstrip() for d in data_search.splitlines() if d.rstrip()]

#build index
from collections import defaultdict
index = defaultdict(set)
for row_number, row in enumerate(data):
    for word in row.split():
        index[word].add(row_number)

#search:
for line in data_s:
    first_word = line.split(maxsplit=1)[0]
    for line_num in index[first_word]:
        if line in data[line_num]:
            print(data[line_num])

Prints:

Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia
Andrej Kesely
  • 168,389
  • 15
  • 48
  • 91
  • You should probably replace `defaultdict(list)` with `defaultdict(set)` – abdusco Jul 22 '19 at 18:00
  • @abdusco: Why? Line numbers are naturally unique (so `set` isn't helping with deduping), and no search for specific line numbers is being performed (so there are no containment tests where `set`'s `O(1)` membership test beats `list`'s `O(n)`), so using a `set` just means using significantly more memory, and losing ordering. – ShadowRanger Jul 22 '19 at 18:06
  • @ShadowRanger But in each line we can have duplicate words, that means we end with duplicate line numbers for that row. – Andrej Kesely Jul 22 '19 at 18:08
  • @AndrejKesely: The OP's code seems to be line oriented, not word oriented, but sure, if they misrepresented the problem the `set` might help a little. – ShadowRanger Jul 22 '19 at 18:10
  • thank you for your answer but it's not giving the correct result for me. It prints the string that i'm searching, not the matched string. For example when I change last line to 'print(line,data[line_num])' it prints two same strings – Robert Zunr Jul 22 '19 at 18:12
  • I think I explained my question bad. actually I'm trying to find lines in b.txt which 'contains' lines from a.txt. – Robert Zunr Jul 22 '19 at 18:25