0

I am searching for matches in large text file, but I find it way too slow. This is the file structure:

word1   5752
word2   96332
word3   137

I am trying to match text in first column, and I want to extract the value in second column. The columns are separated by \t and there are about 10 million lines. The file is searched many times with different words. What method of searching has the best time efficiency?

EDIT: The file is 129 Mb and will be searched thousands of times at least. EDIT2: File is alphabetically sorted and words can occur multiple times only if they have different capital letters e.g: Word WORD word WOrd will all be different entries.

querti
  • 77
  • 2
  • 8
  • How are you searching, and how are you loading the data? For example, if you are loading the whole file into memory then that could be the reason for poor performance. Alternatively you might be better with a different algorithm, can you search for the different words on each line before reading it again? – cdarke Feb 17 '17 at 14:51
  • 2
    Depending on how many times you search the data you could load the entire file into memory and transform it into a dictionary. Although this might consume some memory. – voidpointercast Feb 17 '17 at 14:53
  • 1
    "What method of searching has the best time efficiency?" — "It depends" — It depends on how much memory your machine has, the length of the words, if it is possible that `word1` has multiple instances in the file, other things that I forget to mention. All in all I'd go with [voidpointercast](http://stackoverflow.com/users/2242806/voidpointercast) suggestion (that now has been uplifted into an [answer](http://stackoverflow.com/a/42301043/2749397)), everything in a dictionary and TEST.. – gboffi Feb 17 '17 at 14:58
  • how large is "large"? – Kelvin Feb 17 '17 at 15:03
  • When you do your tests, remember to test against `fgrep` — `fgrep` interprets PATTERN as a list of fixed strings (instead of regular expressions), separated by newlines, any of which is to be matched. – gboffi Feb 17 '17 at 15:10
  • @gboffi Good point mentioning the possibility that words on the left could occur multiple times. – voidpointercast Feb 17 '17 at 15:18

4 Answers4

2
with open('myfile.dat','r') as src:
    mapping = dict((line.strip().split('\t') for line in src if line))

Depending on the size of the file and memory, this could be a solution. If you have to perform this kind of search algorithm more than once during a run of your program.

  • The file is 129 Mb and will be searched thousands-tens of thousands times. So I guess memory requirements are not that high, I am more interested in time efficiency. – querti Feb 17 '17 at 15:20
  • Obtaining a value from a dictionary key should be quite efficient. I guess they are implemented using binary trees, but I don't know for sure. – voidpointercast Feb 17 '17 at 15:22
2

If you store your data in a Hash Table (a python dictionary structure) it would be very quick to do this operation. Your 'Key' is the name, each Key has a 'Value', the number. This code shown below utilizes the hash for a faster data retrieval:

yourDict = {'name0':number0,'name1':number1,...,'nameN':numberN}
if 'checkName' in yourDict:
    #It exists!
    theNumber = yourDict['checkName']
else:
    #It doesn't exist :/

*Note: if you use:

if 'checkName' in yourDict.keys():

you are actually creating a list of keys and then searching through them. This operation doesn't utilize the Hash Table (much slower).

This is a bit on how HandTable Data Structures work: https://www.youtube.com/watch?v=MfhjkfocRR0

This is an answer showing that a dictionary in python acts like a Hash Table: Is a Python dictionary an example of a hash table?

Community
  • 1
  • 1
Gil Elbaz
  • 21
  • 3
  • How about: theNumber = yourDict.get('checkName')? theNumber would be None, if value not existent. Any insight regarding performance? – voidpointercast Feb 17 '17 at 15:26
  • I just looked it up and it looks like you're right! It seems a bit counter-intuitive to me, but this guy tested it on extremely large files and got a significant improvement. So the code should be: yourDict.get('checkName') https://partofthething.com/thoughts/?p=513 – Gil Elbaz Feb 17 '17 at 16:12
1

Is this for an assignment or for work/project? I don't know how people feel about re-implementing core algorithms, but you how big is your text file?

An alternative approach using Pandas for ease of use and underlying optimization:

In [61]: df = pd.read_csv(r'C:\temp\data.txt', header=None, sep='  ')

In [62]: df
Out[62]:
       0      1
0  word1   5752
1  word2  96332
2  word3    137

In [63]: df[df[0] == 'word2']
Out[63]:
       0      1
1  word2  96332

In [64]: df[df[0] == 'word2'][1]
Out[64]:
1    96332
Name: 1, dtype: int64

2 questions for you:

1) Can this be held in memory instead of being re-loaded every time? (maybe with a TTL of like an hour?)

2) Is your file sorted? I believe Binary Search needs to have sorted data first. What would the impact to performance be to sort every time you have to read the data?

Kelvin
  • 1,357
  • 2
  • 11
  • 22
  • 1. It can be held in memory, sure 2. It is sorted – querti Feb 17 '17 at 15:14
  • If you can keep it in memory, how you read it in becomes of less importance. I think your best performance will come from leaving it in a dataframe and just querying it. – Kelvin Feb 17 '17 at 15:19
0

I would first sort the file alphabetically and then perform a logarithmic search (https://en.wikipedia.org/wiki/Binary_search_algorithm). You have a nice example on how to do it with python here: http://programarcadegames.com/index.php?chapter=searching&lang=en#section_16.5

Numlet
  • 819
  • 1
  • 9
  • 21