I'm trying to extract some data from an HTML document with re module in Python 3.
I downloaded the source HTML of this URL: http://diablo2.diablowiki.net/Rune_list and renamed the file as rune_list.html
.
What I want is in the div
block with id="mw-content-text"
,
so I wrote this code:
import re
file=open('rune_list.html','r')
data=file.read()
file.close()
pat=re.compile(r'<div id="mw-content-text"[\s\S]*</div>')
found=re.search(pat,data)
And..nothing found. I know that maybe the regex is not so good, because as I understood, the presence of * could include other </div>
into this one, making the matched string a huge chunk of div
s.
But why it doesn't find anything?
I tried the same exact pattern with a file written by me, a long string
that begins with "<div id="mw-
..." and ends with "</div>
", with some random tables in it, to mimic what I want to find: in this case a matching string is found.The regex, although not well written, should work on the original too. I know that these lines are present in the document.
So I tried simpler searches on the original document: first I searched for mw-content-text
, without double quotes, and a matching string is found.
Then I tried "mw-content-text"
, with double quotes, and nothing is found.It doesn't find the bigger pattern because it doesn't find this one.
It's confusing, if I search for <div id="mw-
... manually in the source page (opened via "view page source" on the browser), the element is there.Besides, I already done some searches with regex on other HTML documents with similar codes, and it works (kinda). I know (and used a bit) other solutions to this problem (e.g. BeautifulSoup
), but I want to try with regex as an exercise.
What am I missing?