2

I just discovered Beautiful Soup, which seem very powerful. I'm wondering if there is an easy way to extract the "alt" field with the text. A simple example would be

from bs4 import BeautifulSoup

html_doc ="""
<body>
<p>Among the different sections of the orchestra you will find:</p>
<p>A <img src="07fg03-violin.jpg" alt="violin" /> in the strings</p>
<p>A <img src="07fg03-trumpet.jpg" alt="trumpet"  /> in the brass</p>
<p>A <img src="07fg03-woodwinds.jpg" alt="clarinet and saxophone"/> in the woodwinds</p>
</body>
"""
soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.get_text())

This would result in

Among the different sections of the orchestra you will find:

A in the strings

A in the brass

A in the woodwinds

But I would like to have the alt field inside the text extraction, which would give

Among the different sections of the orchestra you will find:

A violin in the strings

A trumpet in the brass

A clarinet and saxophone in the woodwinds

Thanks

Portland
  • 185
  • 7
  • take a look at: http://stackoverflow.com/questions/2612548/extracting-an-attribute-value-with-beautifulsoup (possible duplicate of this question) – JacobIRR Apr 24 '17 at 03:47

3 Answers3

2

Please consider this approach.

from bs4 import BeautifulSoup

html_doc ="""
<body>
<p>Among the different sections of the orchestra you will find:</p>
<p>A <img src="07fg03-violin.jpg" alt="violin" /> in the strings</p>
<p>A <img src="07fg03-trumpet.jpg" alt="trumpet"  /> in the brass</p>
<p>A <img src="07fg03-woodwinds.jpg" alt="clarinet and saxophone"/> in the woodwinds</p>
</body>
"""
soup = BeautifulSoup(html_doc, 'html.parser')
ptag = soup.find_all('p')   # get all tags of type <p>

for tag in ptag:
    instrument = tag.find('img')    # search for <img>
    if instrument:  # if we found an <img> tag...
        # ...create a new string with the content of 'alt' in the middle if 'tag.text'
        temp = tag.text[:2] + instrument['alt'] + tag.text[2:]
        print(temp) # print
    else:   # if we haven't found an <img> tag we just print 'tag.text'
        print(tag.text)

The output is

Among the different sections of the orchestra you will find:
A violin in the strings
A trumpet in the brass
A clarinet and saxophone in the woodwinds

The strategy is:

  1. Find all <p> tags
  2. Search for an <img> tag in these <p> tags
  3. If we find and <img> tag insert the content of its alt attribute into the tag.text and print it out
  4. If we don't find an <img> tag just print out
dtell
  • 2,488
  • 1
  • 14
  • 29
  • Thanks a lot @datell. It works fine. One more question. If I had two images in the same paragraph, like

    Among the different sections of the orchestra you will find:

    A violin in the strings. A trumpet in the brass

    A clarinet and saxophone in the woodwinds

    , then it wouldn't extract the second one. Any idea about 2 pr more "img" in the same paragraph ?
    – Portland Apr 24 '17 at 20:08
1
a = soup.findAll('img')

for every in a:
    print(every['alt'])

This will do the job.

1.line finds all the IMG (We used .findAll)

or for the text

print (a.text)
for eachline in a:
    print(eachline.text)

simple for loop that goes through each of the results or manually soup.findAll('img')[0] then soup.findAll('img')[1].. and so on

innicoder
  • 2,612
  • 3
  • 14
  • 29
  • Thanks, but your code returns violin trumpet clarinet and saxophone. This was not my question, I would like these inside the text "at the right place", as per my original post. – Portland Apr 24 '17 at 13:46
0

If you want a general solution, you can use the function get_all_text() as defined bellow, as alternative to the standard get_text():

from bs4.element import Tag, NavigableString

def get_all_text(element, separator=u"", strip=False):
    """
    Get all child strings, including images alt text, concatenated using the given separator.
    """
    strings = []
    for descendant in element.descendants:
        if isinstance(descendant, NavigableString):
            string = str(descendant.string)
        elif isinstance(descendant, Tag) and descendant.name == 'img':
            string = descendant.attrs.get('alt', '')
        else:
            continue
        if strip:
            string = string.strip()
        if string != '':
            strings.append(string)
    return separator.join(strings)

With this solution, you can also define a custom separator and chose if you want to strip the strings, as with the standard get_text(). It will also work in different scenarios.

In your example, it would be like that:

from bs4 import BeautifulSoup

html_doc ="""
<body>
<p>Among the different sections of the orchestra you will find:</p>
<p>A <img src="07fg03-violin.jpg" alt="violin" /> in the strings</p>
<p>A <img src="07fg03-trumpet.jpg" alt="trumpet"  /> in the brass</p>
<p>A <img src="07fg03-woodwinds.jpg" alt="clarinet and saxophone"/> in the woodwinds</p>
</body>
"""
soup = BeautifulSoup(html_doc, 'html.parser')
print(get_all_text(soup))

Output:


Among the different sections of the orchestra you will find:
A violin in the strings
A trumpet in the brass
A clarinet and saxophone in the woodwinds