0

I have a webpage that I want to scrape using regex. The page may contain up to 3 text blocks that I care about.

If all three text blocks exist, then it should return a match, otherwise return no match. The text can be in any order on the page.

I tried this, but it doesn't satisfy the "any order" requirement:

re_text = (Text block 1)((.|\n)*)(Text block 2)((.|\n)*)(Text block 3)
re_compiled = re.compile(re_text)

Should I use backreferences here? Or is there another solution?

Nate Barbettini
  • 51,256
  • 26
  • 134
  • 147
Peter
  • 1,065
  • 14
  • 29
  • 2
    The actual solution is [to not use regex](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags). You should really use an XML parser . . . – ernie Oct 25 '12 at 00:32
  • You could just have 3 separate regexes, and three flags like `block1_found = False`. Search separately for each of them, then check if all the flags are true. Keep it simple. – Marius Oct 25 '12 at 00:33
  • you can use `in` operator to see if the text block is in the text. – pogo Oct 25 '12 at 00:35
  • @Pogo: yes, assuming the text blocks are constant text. – nneonneo Oct 25 '12 at 00:35
  • @ernie Not unless the XML parser can process broken XML as webpage source is not guaranteed to be valid XML – kgr Oct 25 '12 at 00:54
  • @ernie, hilarious, thanks. Unfortunately the HTML tags aren't going to be the same (nor will the text blocks, but i'll use a json config for those) – Peter Oct 25 '12 at 02:45

2 Answers2

3

How about just looking for them individually?

re_texts = [re.compile('textblock1'), re.compile('textblock2'), re.compile('textblock3')]

if all(r.search(text) for r in re_texts):
    # all matches found
nneonneo
  • 171,345
  • 36
  • 312
  • 383
-1
>>> ('a' and 'b' and 'c') in 'xyz'
False
>>> ('a' and 'b' and 'c') in 'ayz'
True
>>> ('a' and 'b' and 'c') in 'abc'
True
pogo
  • 1,479
  • 3
  • 18
  • 23
  • This is bad. It evaluates to `'a' in 'abc'` – lunixbochs Oct 25 '12 at 00:42
  • With `and` it would work, but one has to know the exact form of `a`, `b` and `c` which might not be the case. – kgr Oct 25 '12 at 00:44
  • @kgr: OP said that the text blocks are constant text – pogo Oct 25 '12 at 00:46
  • Pogo I'm not saying your answer is wrong, in fact I like it better than that provided by @nneonneo because it's shorter and perhaps more efficient. Just wanted to make it clear that it won't work in all cases, but in this case might just do the job indeed :) – kgr Oct 25 '12 at 00:55
  • 1
    ...this doesn't work. `'a' and 'b' and 'c'` evaluates to `'c'`, since it's the last element of the chain. Also, `('a' and 'b' and 'c') in 'ayz'` gives me `False` on my Python, so I think you must've made the output up... – nneonneo Oct 25 '12 at 03:16