2

I have a text file that looks like this (close to 1,500,000 lines with ~5-120 words per line of varying length):

This is a foo bar sentence.
What are you sure a foo bar? or a foo blah blah.
blah blah foo sheep have you any bar?
...

I want to search for lines that contains a phrase (max 10,000 line), let's say foo bar. So in python, i wrote this:

import os
cmd = 'grep -m 10,000 "'+frag+'" '+deuroparl + " > grep.tmp"
os.system(cmd)
results = [i for i in open('grep.tmp','r').readlines()]

What is the "proper" way to do it without cheating with grep? Will it be faster than grep (see How does grep run so fast?)? Is there a faster way to do this?

Community
  • 1
  • 1
alvas
  • 115,346
  • 109
  • 446
  • 738

4 Answers4

4
with file('bla.txt') as input:
  for count, line in enumerate(input):
    if count > 10000:
      break
    if re.search('foo bar', line):
      print line

I don't think it will be faster than grep because that one is optimized to do exactly this task while Python is a swiss army knife.

In case you want to use stdin, you can strip the first line and just use sys.stdin as input instead.

Alfe
  • 56,346
  • 20
  • 107
  • 159
3

You can minimize memory usage by using a generator function:

import re

def matcher(filename, pattern, maxmatches):
    matches = 0
    pattern = re.compile(pattern)
    with open(filename) as fp:
        for line in fp:
            if pattern.match(line):
                matches += 1
                if matches > maxmatches:
                    break
                yield line.strip()

for line in matcher('whatever.txt', 'foo bar', 10000):
    print line
martineau
  • 119,623
  • 25
  • 170
  • 301
1

To generalize slightly, the itertools module has very useful methods for building pipe-style processing streams that are memory efficient:

from itertools import ifilter

def grepper(lineno, line):
  return "foo bar" in line

result = ifilter(grepper, enumerate(open("yourfile.txt")))
payne
  • 13,833
  • 5
  • 42
  • 49
0

If you are searching only for a specific text (ie not a regex, as it appears from your title), then:

with open("fileName","r") as fileHandle:
    result = [line.strip() for line in fileHandle if "yourWord" in line]
             # Or use a generator above instead 
print result
UltraInstinct
  • 43,308
  • 12
  • 81
  • 104
  • What about the 10000 lines restriction? What about lines starting with spaces? What about saving memory by doing it sequentially instead of keeping everything in memory? (Consider lots of matches.) What about formatting the resulting `list` like the output of `grep` would have been? – Alfe Nov 22 '13 at 09:46
  • 1. I suggested a generator which is very memory efficient. Just read 10000 entries from the generator. 2. There wont be any problems if lines started with spaces. 3. Formatting? OP can handle it. I merely suggested a *Pythonic way* if his task is to only match whole words. – UltraInstinct Nov 22 '13 at 09:52
  • 1
    Okay, I should have been more precise. Your code does not provide any obvious means of breaking after 10000 lines, although this was an explicit requirement. Also, reading 10000 values from the generator would not meet that requirement; the 10000 belongs to the input, not the output. Your `strip` also removes spaces (etc.) at the beginning of lines (and in this differs from the `grep` output). Concerning formatting I agree. A Pythonic way was asked for, this is more Pythonic than a stream. – Alfe Nov 22 '13 at 10:04