4

I have a text corpus of 11 files each having about 190000 lines. I have 10 strings one or more of which may appear in each line the above corpus.

When I encounter any of the 10 strings, I need to record that string which appears in the line separately. The brute force way of looping through the regex for every line and marking it is taking a long time. Is there an efficient way of doing this?

I found a post (Match a line with multiple regex using Python) which provides a TRUE or FALSE output. But how do I record the matching regex from the line:

any(regex.match(line) for regex in [regex1, regex2, regex3])

Edit: adding example

regex = ['quick','brown','fox']
line1 = "quick brown fox jumps on the lazy dog" # i need to be able to record all of quick, brown and fox
line2 = "quick dog and brown rabbit ran together" # i should record quick and brown
line3 = "fox was quick an rabit was slow" # i should be able to record quick and fox.

Looping through the regex and recording the matching one is one of the solutions, but looking at the scale (11 * 190000 * 10), my script is running for a while now. i need to repeat this in my work quite many times. so i was looking at a more efficient way.

Community
  • 1
  • 1
okkhoy
  • 1,298
  • 3
  • 16
  • 29
  • 1
    What are the regex that you're trying to match? You can probably combine them into 1 regex pretty easily ... – mgilson Oct 23 '12 at 12:33
  • I think you need to provide a more detailed explanation of what you're actually trying to do. I don't understand "record that string which appears in the line separately" - what exactly do you mean by "record"? Do you want to record the match, the regex that matched, the line where the regex matched, the position in the line where the regex matched? What if there are matches on more than one line? Does that matter? Etc. – Tim Pietzcker Oct 23 '12 at 12:50
  • @TimPietzcker hope the additional information in the edit helps explain my problem? – okkhoy Oct 23 '12 at 12:53
  • 1
    Not sure yet - so you want your result as a list like `[["quick", "brown", "fox"], ["quick", "brown"], ["fox", "quick"]]`? What if a line doesn't match at all? Do you want the match or the regex in this list (here they are identical but what about a regex like `qu\w*ck`)? – Tim Pietzcker Oct 23 '12 at 12:58
  • @TimPietzcker the output you have suggested is right. i need the regex in the list not the match. if there is no match i will record '' (null string) sorry for the confusion created! – okkhoy Oct 23 '12 at 13:09
  • your question seems like a simpler version of [How to match a string against a set of wildcard strings efficiently?](http://stackoverflow.com/q/12904860/4279) – jfs Oct 23 '12 at 13:23
  • OK; if you want to record the regex, I take it that the order in which those regexes matched in a given line does not matter? – Tim Pietzcker Oct 23 '12 at 13:45
  • @TimPietzcker no. order does not matter. – okkhoy Oct 23 '12 at 17:14

2 Answers2

7

The approach below is in the case that you want the matches. In the case that you need the regular expression in a list that triggered a match, you are out of luck and will probably need to loop.

Based on the link you have provided:

import re
regexes= 'quick', 'brown', 'fox'
combinedRegex = re.compile('|'.join('(?:{0})'.format(x) for x in regexes))

lines = 'The quick brown fox jumps over the lazy dog', 'Lorem ipsum dolor sit amet', 'The lazy dog jumps over the fox'

for line in lines:
    print combinedRegex.findall(line)

outputs:

['quick', 'brown', 'fox']
[]
['fox']

The point here is that you do not loop over the regex but combine them. The difference with the looping approach is that re.findall will not find overlapping matches. For instance if your regexes were: regexes= 'bro', 'own', the output of the lines above would be:

['bro']
[]
[]

whereas the looping approach would result in:

['bro', 'own']
[]
[]
Community
  • 1
  • 1
cooltea
  • 1,113
  • 7
  • 16
1

If you're just trying to match literal strings, it's probably easier to just do:

strings = 'foo','bar','baz','qux'
regex = re.compile('|'.join(re.escape(x) for x in strings))

and then you can test the whole thing at once:

match = regex.match(line)

Of course, you can get the string which matched from the resulting MatchObject:

if match:
    matching_string = match.group(0)

In action:

import re
strings = 'foo','bar','baz','qux'
regex = re.compile('|'.join(re.escape(x) for x in strings))

lines = 'foo is a word I know', 'baz is a  word I know', 'buz is unfamiliar to me'

for line in lines:
    match = regex.match(line)
    if match:
        print match.group(0)

It appears that you're really looking to search the string for your regex. In this case, you'll need to use re.search (or some variant), not re.match no matter what you do. As long as none of your regular expressions overlap, you can use my above posted solution with re.findall:

matches = regex.findall(line)
for word in matches:
    print ("found {word} in line".format(word=word))

mgilson
  • 300,191
  • 65
  • 633
  • 696
  • reading comprehension... he needs a True/False result for each individual regex. – l4mpi Oct 23 '12 at 12:36
  • 1
    @l4mpi -- I don't think so -- the question states "How do I record **the** (singular, emphasis mine) matching regex from the line" ... You can figure out which regex matched from the corresponding MatchObject (e.g. `match.group(0)`) – mgilson Oct 23 '12 at 12:38
  • from OP: "I need to record that string which appears in the line separately." – l4mpi Oct 23 '12 at 12:39
  • for example, my regexes are ['quick', 'brown', 'fox'] and i have a line "the jumping brown dog scared the quick fox away" here i need to be able to record the words quick,brown and fox since all three are present in the line. a simple compile and match will not help me here. right? – okkhoy Oct 23 '12 at 12:44
  • @mgilson this doesn't handle multiple words in the same line, e.g. "foo bar baz" prints "foo" - not sure if OP needs this though (edit: he does^^). – l4mpi Oct 23 '12 at 12:47
  • @okkhoy -- A simple `compile` and `match` won't help, but you might be able to use a `compile` and a `findall` which will work as long as your regex can't overlap -- `'quick','brown','fox'` is OK. `'quick','brown','fox','quick dog'` won't work. – mgilson Oct 23 '12 at 12:49