2

I have a large text file (5 million to 10 million lines). Each line has 10 to 5,000 alphanumeric items which are separated from each other by a space or comma. Each line ends with a linebreak. The number of lines is known before runtime and the text file doesn't change during runtime.

Each time the code is run, it will be passed 50-200 search lists, each of which contains 2-10 items (words). For each of these search lists, I'd like to find x number of lines in the text file which contain at least one item from that list. The number of lines x ranges from 5-10 lines and is set at runtime. The match is case-insensitive and must be exact on word boundary (e.g., "foo" matches to ",foo " but not to " foobar ").

Each search list has one of three search order strategies:

  • Normal. Start at first line, search line-by-line in consecutive order until it finds x number of lines or reaches the last line.

  • Random. Pick lines at random from the text file. Keep going until it finds x number of lines or has completed search of every line.

  • Bucket range. First search the highest priority range of lines, then the next highest priority range of lines, then the next, etc. For example, the search range priority might be to first look through lines 1,000,000 to 1,499,999, then lines 0 to 999,999, then lines 1,500,000 to 2,199,999, etc. There can be between 3 and 20 buckets, with each covering a range of 100,000 to 5,000,000 lines. The number of buckets and line-number ranges are given at runtime. Within each range, search consecutively. Search until it finds x number of lines or reaches the end of the last bucket.

Here's what I wrote for the "normal search" [this test code checks all lines to the end of the file without stopping after x lines; in the final version, after it finds x matches for a search list, it'll stop and move on to the next search list]:

import re
textFile = "bigTextFile.txt"
searchItems = ["foo","bar","anotherfoo","anotherbar","etc"]
searchStrategy = "normal" #could be "normal", "random", "bucket"; this code is just for "normal"
results = []
with open(textFile) as inFile:
    for line in inFile:
        regex = re.compile('(' + '|'.join('\\b'+item+'\\b' for item in searchItems) +')', re.IGNORECASE)
        match = regex.search(line)  
        if match is not None:
            analysisResult = analyze_the_match(line)
            results.append(analysisResult)

This code for the "normal search" works. Of what I've tried, it seems to be the fastest, but I'm new with Python and I'd guess there must be a better way (speed/efficiency) to do this.

[Update in response to comments to better explain the reason for different search strategies] The items are highly skewed. Playing with the data, I've found that about half the searches will complete before 10,000 lines, 40% could take a few million lines to find enough matches, and 10% won't find enough matches in the whole text file. The items in each line have no relation to the line they're in, but ranges of lines are similar (i.e. lines 100,000-500,000 are related, 1,500,000-1,750,000 are related, etc). The code could be run many times for a given search list, and for each run, the priority might be to focus on a different range of lines. If the whole text file has only x lines with the items in a particular search list, then the result will always be those x lines. But for many search lists, there are 2x, 10x, or 100,000x lines which match, and at different times, I'd like to pick different lines. At certain times, a particular range might be the priority, at other times random sampling is best, and sometimes just finding the first x lines from the beginning is fine, hence the "normal", "random" and "bucket" search strategies.

I'd really appreciate any ideas for "random" and "bucket", as I'm not sure how to best approach them efficiently. I played around with linecache, itertools islice, readlines (per @Martin Thoma in this answer, readlines is surprisingly fast), as well as modifying the code above, but my attempts at "random" and "bucket" are all clunky, inefficient, and I know that I don't know enough to know what could be best :).

Any suggestions? Thanks.

Community
  • 1
  • 1
LunaiThi
  • 111
  • 1
  • 8
  • What relation does the data have to line/position? If it is entirely random (or front-skewed), why *not* always start from the beginning of the file and read - in normal iterable lines, Python will buffer the system IO calls and an iterable file reader will only consumer as much as needed - until the condition is met? – user2864740 Apr 20 '16 at 04:05
  • I'm not sure what you're looking for, sorry. It seems like you already have a fair working solution that meets your needs. Are you looking for help implementing random and bucket range search (in which case it seems you already are aware of an impressive arsenal of tools for the same, and are trying to pick the best tool) or on optimising your existing code (in which case, remember: premature optimisation is the root of all evil)? – Akshat Mahajan Apr 20 '16 at 04:07
  • @user2864740 Thanks. I see your point to *always* search beginning to end, but in this case, I think I have to use diff ordering strategies. The items in each line have no relation to the line they're in, but ranges of lines are similar (i.e. lines 100,000-500,000 is related, 1,500,000-1,750,000 is related, etc). So, quick reasons why need different strategies: a) each list has different characteristics and (might) need to prioritize different range than other lists, b) same list could be run many times and needs to prioritize different range and (if possible) return diff results each time – LunaiThi Apr 20 '16 at 04:28
  • @AkshatMahajan Sorry not clear. Question: how to implement random and bucket search? Normal search works, but not sure if it's good solution. Random and bucket search, I don't know how to even approach it efficiently. The range of tools I found might be "impressive", but I'm not lol and I don't know how to create an efficient solution. – LunaiThi Apr 20 '16 at 04:33
  • Shouldn't you *index* the file at the beginning and then use that index of words to do efficient lookups for each search term? – John Zwinck Apr 20 '16 at 04:53
  • 2
    Any reason why this flat file is not loaded into some columnar data structure where random reads or searches would be alot more efficient than just looping over a text document? – OneCricketeer Apr 20 '16 at 04:56
  • Are you asking how to implement the algorithm, or you have an implementation but its not efficient and you want help improving it? – Burhan Khalid Apr 20 '16 at 05:03
  • This isn't really about regex, so why the tag? That's however how I ended up here, and I've got a suggestion: I imagine with that amount of data that performance could be an issue, so change the RE to have the check for word boundaries outside the group. I.e. `re.compile('\\b(' + '|'.join(item for item in searchItems) +')\b', re.IGNORECASE)`. That'll improve the RE performance (**and** get rid of some string concatenation from the loop ;) – SamWhan Apr 20 '16 at 11:20
  • @John: index and columnar data structure? Sorry, would that mean put the whole text file into sqite3, build out an index of items<->lines, then use that for searching? Once built, it'd definitely be more efficient. But the text file is different each time, so it'd have to be built every run. Wouldn't that end up taking more time than it saves? Is there a very efficient way to do that somehow? – LunaiThi Apr 21 '16 at 03:10
  • @cricket_007: columnar data, you mean a database? See note I made above re: John's comment. I'd think it would take longer to index it than to just go through it, but let me know what you think would be good. Thanks :) – LunaiThi Apr 21 '16 at 03:13
  • 1
    Given that you need to search each corpus 50-200 times, it will probably be much faster to scan the corpus once, index it, then search in the index. Yes, you could use SQLite, or you could write some code for it yourself. Either way should be much faster than repeated linear search. – John Zwinck Apr 21 '16 at 03:13
  • @ClasG Sorry for dragging you here, but thanks a lot for the regex suggestion. That's a great improvement! – LunaiThi Apr 21 '16 at 03:13
  • @JohnZwinck I tried several indexing strategies, in sqlite as well as my own code. So far, it seems slower. Turns out around 80% of searches can finish quickly, 15% take a few million lines and only 5% do the full search. In most cases, the benefit from indexing (at least how I've tried it) hasn't been enough to offset its cost. But it's a gamble, because in those few cases, the no-index searches are long and indexing would have been better. I can't see any way to know in advance which strategy to take for any given set of searches (that's the whole point of the search, after all). – LunaiThi Apr 29 '16 at 02:18

1 Answers1

0

For both the random and bucket searches, you can do a linear scan of the file and build a list of candidate results, replacing a candidate from the list if the list is full and a better candidate comes along.

For random selections, you simply calculate the odds of a candidate being in the list and use a random number to determine if the candidate gets placed in the list or not.

For bucket selections, a candidate replaces an existing list member if its bucket rank is better than the rank of the existing item.

For random selection:

import random
candidates = []
n = 0
with open(textFile) as inFile:
    for line in inFile:
        if valid_candidate(line): # use regex here
            n += 1
            if n <= x:
                candidates.append(line)
            else:
                i = random.randrange(n)
                if i < x:
                    candidates[i] = line
results = []
for line in candidates:
    analysisResult = analyze_the_match(line)
    results.append(analysisResult)

For bucket selection:

import bisect
candidates = []
n = 0
with open(textFile) as inFile:
    for line in inFile:
        n += 1
        if valid_candidate(line): # use regex here
            rank = get_rank(n) # use n to determine the bucket and assign a rank, lower numbers are higher rank
            bisect.insort(candidates, (rank, n, line))
results = []
for rank, n, line in candidates[:x]:
    analysisResult = analyze_the_match(line)
    results.append(analysisResult)
Mark Ransom
  • 299,747
  • 42
  • 398
  • 622
  • Seems good so far. I'm testing it against indexes and other things, it takes a while for all the tests to run. I'll come back to mark the answer. Thanks. – LunaiThi Apr 29 '16 at 02:19