Find regex occurences of a set of patterns in correct order with Python

Question

I am parsing a series of text files for some patterns, since I want to extract them to other file.

A way to say it is that I would like to "remove" everything except the matches from the file.

For example, if I have pattern1, pattern2, pattern3 as matching patterns, I'd like the following input:

bla bla
pattern1
pattern2
bla bla bla
pattern1
pattern3
bla bla bla
pattern1

To give the following output:

pattern1
pattern2
pattern1
pattern3
pattern1

I can use re.findall and successfully get the list of matches for any pattern, but I cannot think of a way to KEEP THE ORDER considering the matches of each pattern are mixed inside the file.

Thanks for reading.

wrote a copy-and-go solution based on @richards solution – Inbar Rose Aug 01 '12 at 14:54 — Inbar Rose, Aug 01 '12 at 14:54

score 5 · Accepted Answer · answered Aug 01 '12 at 14:35

5

Combine it all into a single pattern. With your example code, use the pattern:

^pattern[0-9]+

If it's actually more complex, then try

^(aaaaa|bbbbb|ccccc|ddddd)

answered Aug 01 '12 at 14:35

Richard

29,854
11
77
120

1

i dont think this works for OP, he has multiple matches in his REGEX that he looks for, "pattern1,pattern2,etc" are examples... see my answer. – Inbar Rose Aug 01 '12 at 14:39
I'll accept this answer, that's what I wanted to do, just didn't know how or didn't remember how. The multiple patterns using `|` (OR) is the key to get in order, cause it says "give me any match of the following patterns", and the result will then come already in order. – heltonbiker Aug 01 '12 at 14:42
oh - i see what you did there now, yes, the second "more complex" regex would work. but OP should still use `file.writelines()` itterating over the list that `re.findall()` returns. – Inbar Rose Aug 01 '12 at 14:42
@InbarRose yeah, the full code needs to write stuff, my doubt was just about the pattern matching. Thanks everybody! – heltonbiker Aug 01 '12 at 14:43
1

just for the record, my unholy pattern is: `'
.+?
|
.+?
'` ;o) – heltonbiker Aug 01 '12 at 14:45
1

@heltonbiker Have you not seen http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454? Do not try to parse HTML with regular expressions. – Abe Karplus Aug 01 '12 at 15:57
@AbeKarplus that's why it is "unholy". I know all of this, but for one-off site scraping without the need for REAL parsing, regexes have helped me a lot, even with the known limitations. That is, do not "count too much" on regexes to parse HTML (will I burn in hell?... ;o) – heltonbiker Aug 01 '12 at 21:13
(that is, regex in this case is not used to "parse html", but to extract some very well behaved parts of a set of html files) – heltonbiker Aug 01 '12 at 21:14

score 2 · Answer 2 · answered Aug 01 '12 at 14:51

here is an answer in "copy this and go" format.

import re

#lets you add more whenever you want
list_of_regex = [r"aaaa",r"bbbb",r"cccc"]

#hold the completed pattern
pattern_string = r"^("

#combines the patterns
for item in list_of_regex:
    pattern_string += "|".join(list_of_regex)

pattern_string += r")"

#open the file that you are reading
fr = open(FILE_TO_READ)

#holds the read files strings
search_string = fr.read()

#close the file
fr.close()

#open the file you want to write to
fw = open(FILE_TO_WRITE, 'w')

#write the results of findall into the file (as requested)
fw.writelines(re.findall(pattern_string,search_string))

#close the file
fw.close()

I suspect `re.compile` could be of some help here, but I'd have to look more thoroughly. Thanks anyway! — heltonbiker, Aug 01 '12 at 21:11

Find regex occurences of a set of patterns in correct order with Python

2 Answers2

.+?