0

I am trying to read an Html page and get some information from it. In one of the lines, the information I need is inside an Image's alt attribute. like so:

<img src='logo.jpg' alt='info i need'>

The problem is that, when parsing this, beautifulsoup is surrounding the contents of alt with double quotes, instead of using the single quotes already present. Because of this, the result is something like this:

<img alt="\'info" i="" need="" src="\'logo.jpg\'"/>

Currently, my code consists in this:

name = row.find("td", {"class": "logo"}).find("img")["alt"]

Which should return "info i need" but is currently returning "\'info" What can I be doing wrong? Is there any settings that I need to change in order to beautifulsoup to parse this correctly?

Edit: my code looks something like this ( I used the standard html parser too, but no difference there )

import sys
import urllib.request
import time
from html.parser import HTMLParser
from bs4 import BeautifulSoup

def main():     
    url = 'https://myhtml.html'
    with urllib.request.urlopen(url) as page:
        text = str(page.read())
        html = BeautifulSoup(page.read(), "lxml")

        table = html.find("table", {"id": "info_table"})
        rows = table.find_all("tr")

        for row in rows:
            if row.find("th") is not None:
                continue
            info = row.find("td", {"class": "logo"}).find("img")["alt"]
            print(info) 


if __name__ == '__main__':
    main()

and the html:

<div class="table_container">
<table class="info_table" id="info_table">
<tr>
   <th class="logo">Important infos</th>
   <th class="useless">Other infos</th>
</tr>
<tr >
   <td class="logo"><img src='Logo.jpg' alt='info i need'><br></td>
   <td class="useless">
      <nobr>useless info</nobr>
   </td>
</tr>
<tr >
   <td class="logo"><img src='Logo2.jpg' alt='info i need too'><br></td>
   <td class="useless">
      <nobr>useless info</nobr>
   </td>
</tr>
Jaejatae
  • 153
  • 1
  • 6
  • It would be better to elaborate on why double and single quote matters. – mad_ Oct 16 '18 at 17:08
  • Possible duplicate of [BeautifulSoup: Extract img alt data](https://stackoverflow.com/questions/11696745/beautifulsoup-extract-img-alt-data) – tif Oct 16 '18 at 17:09
  • @tif, I tried the answer on your link, but the result was the same. – Jaejatae Oct 19 '18 at 02:45
  • I would suggest you to use html5lib oevr lxml parser while using BeautiulSoup as i have experienced better results and reduced complications among the parsers. But BeautifulSoup documentation says it as slower parser among available three. – SanthoshSolomon Oct 19 '18 at 07:09

1 Answers1

1

Sorry, I am unable to add a comment.

I have tested your case and for me the output seems correct.

HTML:

<html>
    <body>
        <td class="logo">
            <img src='logo.jpg' alt='info i need'>
        </td>
    </body>
</html>

Python:

from bs4 import BeautifulSoup

with open("myhtml.html", "r") as html:
    soup = BeautifulSoup(html, 'html.parser')
    name = soup.find("td", {"class": "logo"}).find("img")["alt"]
    print(name)

Returns:

info i need

I think your problem is a encoding problem while write the file back to html.

Please provide the full code and further information.

  • html
  • your python code

Update:

I've tested your code, your code is not working at all :/ After rework i was able to get required output as a result.

import sys
import urllib.request
import time
from html.parser import HTMLParser
from bs4 import BeautifulSoup

def main():     
    url = 'https://code.mytesturl.net'
    with urllib.request.urlopen(url) as page:

        soup = BeautifulSoup(page, "html.parser")
        name = soup.find("td", {"class": "logo"}).find("img")["alt"]
        print(name)


if __name__ == '__main__':
    main()

Possible problems:
Maybe your parser should be html.parser
Python version / bs version ?

Zim
  • 410
  • 4
  • 13
Fabian
  • 1,130
  • 9
  • 25
  • Edited the question to add the information you need – Jaejatae Oct 18 '18 at 15:46
  • Thank you @Fabian, it turns out my problem was how I was using urllib, your example showed me that. I was using str(page.read()), when I should be using BeautifulSoup(page, "html.parser") directly. Thanks again for your help :) – Jaejatae Oct 19 '18 at 14:31