Apparent inconsistency with RegEx pattern - Python 3

Question

I'm trying to extract some data from an HTML document with re module in Python 3. I downloaded the source HTML of this URL: http://diablo2.diablowiki.net/Rune_list and renamed the file as rune_list.html.

What I want is in the div block with id="mw-content-text", so I wrote this code:

import re

file=open('rune_list.html','r')
data=file.read()
file.close()

pat=re.compile(r'<div id="mw-content-text"[\s\S]*</div>')
found=re.search(pat,data)

And..nothing found. I know that maybe the regex is not so good, because as I understood, the presence of * could include other </div> into this one, making the matched string a huge chunk of divs.

But why it doesn't find anything? I tried the same exact pattern with a file written by me, a long string that begins with "<div id="mw-..." and ends with "</div>", with some random tables in it, to mimic what I want to find: in this case a matching string is found.The regex, although not well written, should work on the original too. I know that these lines are present in the document.

So I tried simpler searches on the original document: first I searched for mw-content-text, without double quotes, and a matching string is found. Then I tried "mw-content-text", with double quotes, and nothing is found.It doesn't find the bigger pattern because it doesn't find this one.

It's confusing, if I search for <div id="mw-... manually in the source page (opened via "view page source" on the browser), the element is there.Besides, I already done some searches with regex on other HTML documents with similar codes, and it works (kinda). I know (and used a bit) other solutions to this problem (e.g. BeautifulSoup), but I want to try with regex as an exercise.

What am I missing?

Not found. The idea with using `[\s\S]` is to find any character, newlines included. As I know `.` means "every character not including the newline" — Russell Teapot, Mar 27 '16 at 20:05
You realise how large that div is? Also your regex works for me. — Padraic Cunningham, Mar 27 '16 at 20:08
http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 — Aaron Christiansen, Mar 27 '16 at 20:15
Yes, I'm not an expert but I saw some huge divs...I don't get it. I tried to execute the script normally and with pdb, but it doesn't work. The files are in the proper places (otherwise `file=open()` should raise an error)... — Russell Teapot, Mar 27 '16 at 20:17
`Parsing HTML with regex summons tainted souls into the realm of the living`. Ok, now i'm legit scared. — Russell Teapot, Mar 27 '16 at 20:20
What do you mean by "nothing found". Did it return `None` or an apparently empty `MatchObject`? If the latter, the problem is that you need a match group within the regex to return something. Perhaps `r'(
)'` (note the wrapping parens make a capture group). As an aside, `[\s\S]*` is more commonly spelled `.*`. — tdelaney, Mar 27 '16 at 20:32
tdelaney, it returns `None`. What you suggested should mean "every character except a newline, 0 or more"? I want to catch `\n` too. `[\s\S]*` should do: "every char that is a white space + every char that is not a white spaces (so, everything),0 or more" — Russell Teapot, Mar 27 '16 at 20:40
Using regex for html scraping is a bad idea, because it's slow and because of your problem. Have a look at http://www.crummy.com/software/BeautifulSoup/bs4/doc/ or http://lxml.de/ — Jesse Bakker, Mar 27 '16 at 20:41
You can use regex for HTML searches. This is never a bad idea. If you set your regex wisely it is perfect for this job. Most probably `//g` is the regexp you are looking for. having said that to get a DOM element with known id... the best method is just to write it's id in JavaScript. Though this won't work with the ids with hyphens in the middle (while underscores are OK) So if it's id is hyphenized you can still access the element by `window["mw-content-text"]` — Redu, Mar 27 '16 at 21:17
@JesseBakker: Believing that scraping html with a regex is slow is wrong. Direct string approaches (with regex or common string functions) are from far faster than DOM approaches (in particular BS4 that is the slowest way) for the simple reason that they do not have to parse the whole document and to build the DOM tree. The main problem with direct string approaches is that they are error prone since the html syntax can be very flexible and full of traps. — Casimir et Hippolyte, Mar 27 '16 at 22:49
I tried the regex on the source of that page and it finds it, but it finds up to the very last `` in the source (78k) because of the greediness of `]\s\S]*`. It finds the previous one (38k) if its non-greedy. I don't see anything wrong with the regex, so that's not the issue. However, I would use this one `
` — , Mar 27 '16 at 23:46

Apparent inconsistency with RegEx pattern - Python 3

0 Answers0