2

I am a beginner at python and I need to check the presence of a given set of string in a huge txt file. I've written this code so far and it runs with no problems on a light subsample of my database. The problem is that it takes more than 10 hours when searching through the whole database and I'm looking for a way to speed up the process.

The code so far reads a list of strings from a txt I've put together (list.txt) and search for every item in every line of the database (hugedataset.txt). My final output should be a list of items which are present in the database (or, alternatively, a list of items which are NOT present). I bet there is a more efficient way to do things though...

Thank you for your support!

import re
fobj_in = open('hugedataset.txt')
present=[]

with open('list.txt', 'r') as f:
    list1 = [line.strip() for line in f]

print list1  

for l in fobj_in:
    for title in list1:
       if title in l:
          print title
          present.append(title)

set=set(presenti)   
print set
user2447387
  • 173
  • 1
  • 3
  • 12
  • Do you need any per-line information? If all you need to know is whether each item is there or not, couldn't you search the whole database as a single string rather than breaking it up line-by-line? – RichieHindle Jun 20 '13 at 12:33
  • No actually I don't need a per-line information, all I need is to know if and which strings are already there.. – user2447387 Jun 20 '13 at 12:35
  • 2
    Since you only need to know which words are present, don't keep a list and `append` to it, then converting it to a set at the end. That means you have to keep track of a (potentially huge) list, wasting memory. Instead keep a `set` and `add` to it. – svk Jun 20 '13 at 12:38
  • 2
    You should also eliminate any extraneous console output in the middle of a tight loop. If there's an actual terminal displaying each title, you could see serious performance issues. – Travis Parks Jun 20 '13 at 12:39
  • 1
    I should also point out that since you are only looking for the presence of a title, once you find it the first time, you never need to look for it again... – Travis Parks Jun 20 '13 at 12:41

2 Answers2

2

Since you don't need any per-line information, you can search the whole thing in one go for each string:

data = open('hugedataset.txt').read()  # Assuming it fits in memory
present=[]  # As @svk points out, you could make this a set

with open('list.txt', 'r') as f:
    list1 = [line.strip() for line in f]

print list1  

for title in list1:
   if title in data:
      print title
      present.append(title)

set=set(present)   
print set
RichieHindle
  • 272,464
  • 47
  • 358
  • 399
1

You could use a regexp to check for all substring with a single pass. Look for example at this answer: Check to ensure a string does not contain multiple values

Community
  • 1
  • 1
Emanuele Paolini
  • 9,912
  • 3
  • 38
  • 64