Simplify python code for txt searching

Question

I am a beginner at python and I need to check the presence of a given set of string in a huge txt file. I've written this code so far and it runs with no problems on a light subsample of my database. The problem is that it takes more than 10 hours when searching through the whole database and I'm looking for a way to speed up the process.

The code so far reads a list of strings from a txt I've put together (list.txt) and search for every item in every line of the database (hugedataset.txt). My final output should be a list of items which are present in the database (or, alternatively, a list of items which are NOT present). I bet there is a more efficient way to do things though...

Thank you for your support!

import re
fobj_in = open('hugedataset.txt')
present=[]

with open('list.txt', 'r') as f:
    list1 = [line.strip() for line in f]

print list1  

for l in fobj_in:
    for title in list1:
       if title in l:
          print title
          present.append(title)

set=set(presenti)   
print set

Do you need any per-line information? If all you need to know is whether each item is there or not, couldn't you search the whole database as a single string rather than breaking it up line-by-line? — RichieHindle, Jun 20 '13 at 12:33
No actually I don't need a per-line information, all I need is to know if and which strings are already there.. — user2447387, Jun 20 '13 at 12:35
Since you only need to know which words are present, don't keep a list and `append` to it, then converting it to a set at the end. That means you have to keep track of a (potentially huge) list, wasting memory. Instead keep a `set` and `add` to it. — svk, Jun 20 '13 at 12:38
You should also eliminate any extraneous console output in the middle of a tight loop. If there's an actual terminal displaying each title, you could see serious performance issues. — Travis Parks, Jun 20 '13 at 12:39
I should also point out that since you are only looking for the presence of a title, once you find it the first time, you never need to look for it again... — Travis Parks, Jun 20 '13 at 12:41

score 2 · Accepted Answer · answered Jun 20 '13 at 12:36

2

Since you don't need any per-line information, you can search the whole thing in one go for each string:

data = open('hugedataset.txt').read()  # Assuming it fits in memory
present=[]  # As @svk points out, you could make this a set

with open('list.txt', 'r') as f:
    list1 = [line.strip() for line in f]

print list1  

for title in list1:
   if title in data:
      print title
      present.append(title)

set=set(present)   
print set

answered Jun 20 '13 at 12:36

RichieHindle

272,464
47
358
399

1

@user2447387 Your huge data set fit in all in memory? – Travis Parks Jun 20 '13 at 12:54

score 1 · Answer 2 · edited May 23 '17 at 12:13

1

You could use a regexp to check for all substring with a single pass. Look for example at this answer: Check to ensure a string does not contain multiple values

edited May 23 '17 at 12:13

Community

1
1

answered Jun 20 '13 at 12:36

Emanuele Paolini

9,912
3
38
64

Simplify python code for txt searching

2 Answers2