1

I am parsing a series of text files for some patterns, since I want to extract them to other file.

A way to say it is that I would like to "remove" everything except the matches from the file.

For example, if I have pattern1, pattern2, pattern3 as matching patterns, I'd like the following input:

bla bla
pattern1
pattern2
bla bla bla
pattern1
pattern3
bla bla bla
pattern1

To give the following output:

pattern1
pattern2
pattern1
pattern3
pattern1

I can use re.findall and successfully get the list of matches for any pattern, but I cannot think of a way to KEEP THE ORDER considering the matches of each pattern are mixed inside the file.

Thanks for reading.

heltonbiker
  • 26,657
  • 28
  • 137
  • 252

2 Answers2

5

Combine it all into a single pattern. With your example code, use the pattern:

^pattern[0-9]+

If it's actually more complex, then try

^(aaaaa|bbbbb|ccccc|ddddd)
Richard
  • 29,854
  • 11
  • 77
  • 120
  • 1
    i dont think this works for OP, he has multiple matches in his REGEX that he looks for, "pattern1,pattern2,etc" are examples... see my answer. – Inbar Rose Aug 01 '12 at 14:39
  • I'll accept this answer, that's what I wanted to do, just didn't know how or didn't remember how. The multiple patterns using `|` (OR) is the key to get in order, cause it says "give me any match of the following patterns", and the result will then come already in order. – heltonbiker Aug 01 '12 at 14:42
  • oh - i see what you did there now, yes, the second "more complex" regex would work. but OP should still use `file.writelines()` itterating over the list that `re.findall()` returns. – Inbar Rose Aug 01 '12 at 14:42
  • @InbarRose yeah, the full code needs to write stuff, my doubt was just about the pattern matching. Thanks everybody! – heltonbiker Aug 01 '12 at 14:43
  • 1
    just for the record, my unholy pattern is: `'

    .+?

    |

    .+?

    '` ;o)
    – heltonbiker Aug 01 '12 at 14:45
  • 1
    @heltonbiker Have you not seen http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454? Do not try to parse HTML with regular expressions. – Abe Karplus Aug 01 '12 at 15:57
  • @AbeKarplus that's why it is "unholy". I know all of this, but for one-off site scraping without the need for REAL parsing, regexes have helped me a lot, even with the known limitations. That is, do not "count too much" on regexes to parse HTML (will I burn in hell?... ;o) – heltonbiker Aug 01 '12 at 21:13
  • (that is, regex in this case is not used to "parse html", but to extract some very well behaved parts of a set of html files) – heltonbiker Aug 01 '12 at 21:14
2

here is an answer in "copy this and go" format.

import re

#lets you add more whenever you want
list_of_regex = [r"aaaa",r"bbbb",r"cccc"]

#hold the completed pattern
pattern_string = r"^("

#combines the patterns
for item in list_of_regex:
    pattern_string += "|".join(list_of_regex)

pattern_string += r")"

#open the file that you are reading
fr = open(FILE_TO_READ)

#holds the read files strings
search_string = fr.read()

#close the file
fr.close()

#open the file you want to write to
fw = open(FILE_TO_WRITE, 'w')

#write the results of findall into the file (as requested)
fw.writelines(re.findall(pattern_string,search_string))

#close the file
fw.close()
Inbar Rose
  • 41,843
  • 24
  • 85
  • 131