Trying to pull data from a poorly formatted HTML website

Question

I've been recently trying to pull information from a website, and while I have been mostly successful it's been a bit of a struggle.

I've been currently using Regex to find some information (in here the names that I want to look at)

webAddress = 'http://meridian.puzzlepirates.com/yoweb/crew/info.wm?crewid=' + str(crewid)
htmlFile = urllib.urlopen(webAddress)
htmlText = htmlFile.read()

regex = 'classic&target=(.+?)">'
pattern = re.compile(regex)
checkMatch = re.findall(pattern,htmlText)

Like so. That works fine when there is a consistent indicator on that specific line. However I now have an issue where my indicator isn't on that line.

 <td width="28" height="28"><a href="/ratings/top_5_0.html"><img 
  src="/yoweb/images/stat-5.png" width="28" height="28" border="0"
  alt="Gunning"></a></td>
<td align="left">
  <font size="-1">
      <i><b>Exalted</b></i>/<b>Master</b>
  </font>

Specifically looking to pull the second to last line, but it is possible that this second to last line is not bolded or italicised/doesn't have the same words, so my indicator sort of has to be "Gunning" since that is the specific area I care about. Unfortunately it's not even always on the same line per different page, so I can't just look at a specific line to attempt to find it. Any suggestions would be great!

EDIT

I've switched to starting to try to learn/use Beautiful Soup (thanks for pointing me in that direction.

I wasn't as clear as I meant to be at first so let me try to clarify.

Specifically trying to pull the ranks from a page like this

 <td width="28" height="28"><a href="/ratings/top_5_0.html"><img 
  src="/yoweb/images/stat-5.png" width="28" height="28" border="0"
  alt="Gunning"></a></td>
<td align="left">
  <font size="-1">
      <i><b>Exalted</b></i>/<b>Master</b>
  </font>

Which the HTML of the section that I am specifically looking for is above, and isn't always in the same formatting (eg it could be non-bolded, bolded, or bolded and italicized. So not really sure what method I could use to reliably pull a specific stat from that information.

I tried isolating via font size as well but the number of results isn't consistent and this I can't isolate the specific stat I want.

Possible duplicate of [RegEx match open tags except XHTML self-contained tags](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) — rohithpr, Jul 03 '16 at 15:12

score 2 · Accepted Answer · edited May 23 '17 at 10:28

The markup is definitely not easy to deal with, but you definitely should not be approaching it with regular expressions. Don't use a tool just because it is familiar to you or you are good with it. Use a tool that is the most suitable in a particular case.

In this case, you need an HTML parser, like BeautifulSoup.

Assuming you want to extract the names (the names in bold in the main crew table):

>>> import requests
>>> from bs4 import BeautifulSoup
>>> url = "http://meridian.puzzlepirates.com/yoweb/crew/info.wm?crewid=5002373"
>>> 
>>> response = requests.get(url)
>>> 
>>> soup = BeautifulSoup(response.content, "html.parser")
>>> table = soup.find('table', width='330')  # relying on width, yeah, does not look reliable
>>> for b in table.find_all('b'):
...     print(b.get_text(strip=True))
... 
Captain
Senior Officer
Fleet Officer
Officer
Pirate
Cabin Person
Jobbing Pirate

Thanks for the help, not 100% what I was looking for specifically but Beautiful Soup definitely seems the much more powerful tool. I edited the main question with some clarifications. — Brennan Bibic, Jul 05 '16 at 16:11
Ah I found the solution. Turns out I could search by font size and count backwards as the end of the generated lists was the same each time. — Brennan Bibic, Jul 05 '16 at 19:42

Trying to pull data from a poorly formatted HTML website

1 Answers1