I am trying to read an Html page and get some information from it. In one of the lines, the information I need is inside an Image's alt attribute. like so:
<img src='logo.jpg' alt='info i need'>
The problem is that, when parsing this, beautifulsoup is surrounding the contents of alt with double quotes, instead of using the single quotes already present. Because of this, the result is something like this:
<img alt="\'info" i="" need="" src="\'logo.jpg\'"/>
Currently, my code consists in this:
name = row.find("td", {"class": "logo"}).find("img")["alt"]
Which should return "info i need" but is currently returning "\'info" What can I be doing wrong? Is there any settings that I need to change in order to beautifulsoup to parse this correctly?
Edit: my code looks something like this ( I used the standard html parser too, but no difference there )
import sys
import urllib.request
import time
from html.parser import HTMLParser
from bs4 import BeautifulSoup
def main():
url = 'https://myhtml.html'
with urllib.request.urlopen(url) as page:
text = str(page.read())
html = BeautifulSoup(page.read(), "lxml")
table = html.find("table", {"id": "info_table"})
rows = table.find_all("tr")
for row in rows:
if row.find("th") is not None:
continue
info = row.find("td", {"class": "logo"}).find("img")["alt"]
print(info)
if __name__ == '__main__':
main()
and the html:
<div class="table_container">
<table class="info_table" id="info_table">
<tr>
<th class="logo">Important infos</th>
<th class="useless">Other infos</th>
</tr>
<tr >
<td class="logo"><img src='Logo.jpg' alt='info i need'><br></td>
<td class="useless">
<nobr>useless info</nobr>
</td>
</tr>
<tr >
<td class="logo"><img src='Logo2.jpg' alt='info i need too'><br></td>
<td class="useless">
<nobr>useless info</nobr>
</td>
</tr>