I've been recently trying to pull information from a website, and while I have been mostly successful it's been a bit of a struggle.
I've been currently using Regex to find some information (in here the names that I want to look at)
webAddress = 'http://meridian.puzzlepirates.com/yoweb/crew/info.wm?crewid=' + str(crewid)
htmlFile = urllib.urlopen(webAddress)
htmlText = htmlFile.read()
regex = 'classic&target=(.+?)">'
pattern = re.compile(regex)
checkMatch = re.findall(pattern,htmlText)
Like so. That works fine when there is a consistent indicator on that specific line. However I now have an issue where my indicator isn't on that line.
<td width="28" height="28"><a href="/ratings/top_5_0.html"><img
src="/yoweb/images/stat-5.png" width="28" height="28" border="0"
alt="Gunning"></a></td>
<td align="left">
<font size="-1">
<i><b>Exalted</b></i>/<b>Master</b>
</font>
Specifically looking to pull the second to last line, but it is possible that this second to last line is not bolded or italicised/doesn't have the same words, so my indicator sort of has to be "Gunning" since that is the specific area I care about. Unfortunately it's not even always on the same line per different page, so I can't just look at a specific line to attempt to find it. Any suggestions would be great!
EDIT
I've switched to starting to try to learn/use Beautiful Soup (thanks for pointing me in that direction.
I wasn't as clear as I meant to be at first so let me try to clarify.
Specifically trying to pull the ranks from a page like this
<td width="28" height="28"><a href="/ratings/top_5_0.html"><img
src="/yoweb/images/stat-5.png" width="28" height="28" border="0"
alt="Gunning"></a></td>
<td align="left">
<font size="-1">
<i><b>Exalted</b></i>/<b>Master</b>
</font>
Which the HTML of the section that I am specifically looking for is above, and isn't always in the same formatting (eg it could be non-bolded, bolded, or bolded and italicized. So not really sure what method I could use to reliably pull a specific stat from that information.
I tried isolating via font size as well but the number of results isn't consistent and this I can't isolate the specific stat I want.