2

I have a string with multiple ids in the image tag:

<img id="webfast-uhyubv" alt="" data-type="image" id="comp-jefxldtzbalatamediacontentimage" src="http://webfast.co/images/webfast-logo.png" /> 

soup = bs4.BeautifulSoup(webpage,"html.parser")
images = soup.findAll('img')
for image in images:
    print image

The above code only returns id=comp-jefxldtzbalatamediacontentimage

Replacing

soup = bs4.BeautifulSoup(webpage,"html.parser")

with

soup = bs4.BeautifulSoup(webpage,"lxml")

returns the first id webfast-uhyubv

However, I want to get both the id's in the order they exist for the input line.

Kalle Richter
  • 8,008
  • 26
  • 77
  • 177
anurag
  • 560
  • 6
  • 13

1 Answers1

1

BeautifulSoup stores the attributes of a tag in a dictionary. Since a dictionary cannot have duplicate keys, one id attribute overwrites the other. You can check the dictionary of attributes using tag.attrs.

>>> soup = BeautifulSoup(tag, 'html.parser')
>>> soup.img.attrs
{'id': 'comp-jefxldtzbalatamediacontentimage', 'alt': '', 'data-type': 'image', 'src': 'http://webfast.co/images/webfast-logo.png'}

>>> soup = BeautifulSoup(tag, 'lxml')
>>> soup.img.attrs
{'id': 'webfast-uhyubv', 'alt': '', 'data-type': 'image', 'src': 'http://webfast.co/images/webfast-logo.png'}

As you can see, we get different value for id using different parsers. This happens as different parsers work differently.

There is no way to get both the id values using BeautifulSoup. You can get them using RegEx. But, use it carefully and as a last resort!

>>> import re
>>> tag = '<img id="webfast-uhyubv" alt="" data-type="image" id="comp-jefxldtzbalatamediacontentimage" src="http://webfast.co/images/webfast-logo.png" />'
>>> ids = re.findall('id="(.*?)"', tag)
>>> ids
['webfast-uhyubv', 'comp-jefxldtzbalatamediacontentimage']
Keyur Potdar
  • 7,158
  • 6
  • 25
  • 40
  • Thanks for the detailed response. Given the HTML variants that exist, I'll be using the regex approach for now. – anurag May 19 '18 at 01:33