findAll() in BeautifulSoup skips over multiple ids

Question

I have a string with multiple ids in the image tag:

<img id="webfast-uhyubv" alt="" data-type="image" id="comp-jefxldtzbalatamediacontentimage" src="http://webfast.co/images/webfast-logo.png" /> 

soup = bs4.BeautifulSoup(webpage,"html.parser")
images = soup.findAll('img')
for image in images:
    print image

The above code only returns id=comp-jefxldtzbalatamediacontentimage

Replacing

soup = bs4.BeautifulSoup(webpage,"html.parser")

with

soup = bs4.BeautifulSoup(webpage,"lxml")

returns the first id webfast-uhyubv

However, I want to get both the id's in the order they exist for the input line.

this code only fetches first id and not the second one – Rachit kapadia May 18 '18 at 05:51 — Rachit kapadia, May 18 '18 at 05:51
@Rachit it depends on the parser. – Keyur Potdar May 18 '18 at 16:09 — Keyur Potdar, May 18 '18 at 16:09

score 1 · Answer 1 · answered May 18 '18 at 07:56

BeautifulSoup stores the attributes of a tag in a dictionary. Since a dictionary cannot have duplicate keys, one id attribute overwrites the other. You can check the dictionary of attributes using tag.attrs.

>>> soup = BeautifulSoup(tag, 'html.parser')
>>> soup.img.attrs
{'id': 'comp-jefxldtzbalatamediacontentimage', 'alt': '', 'data-type': 'image', 'src': 'http://webfast.co/images/webfast-logo.png'}

>>> soup = BeautifulSoup(tag, 'lxml')
>>> soup.img.attrs
{'id': 'webfast-uhyubv', 'alt': '', 'data-type': 'image', 'src': 'http://webfast.co/images/webfast-logo.png'}

As you can see, we get different value for id using different parsers. This happens as different parsers work differently.

There is no way to get both the id values using BeautifulSoup. You can get them using RegEx. But, use it carefully and as a last resort!

>>> import re
>>> tag = '<img id="webfast-uhyubv" alt="" data-type="image" id="comp-jefxldtzbalatamediacontentimage" src="http://webfast.co/images/webfast-logo.png" />'
>>> ids = re.findall('id="(.*?)"', tag)
>>> ids
['webfast-uhyubv', 'comp-jefxldtzbalatamediacontentimage']

Thanks for the detailed response. Given the HTML variants that exist, I'll be using the regex approach for now. — anurag, May 19 '18 at 01:33

findAll() in BeautifulSoup skips over multiple ids

1 Answers1