Solution with lxml (but without bs!):
from lxml import etree
xml = '<a href="https://wikipedia.org/" rel="nofollow ugc">wiki</a>'
root = etree.fromstring(xml)
print(root.attrib)
>>> {'href': 'https://wikipedia.org/', 'rel': 'nofollow ugc'}
But there's no text
attribute.
You can extract it by using text
property:
print(root.text)
>>> 'wiki'
To conclusion:
from lxml import etree
xml = '<a href="https://wikipedia.org/" rel="nofollow ugc">wiki</a>'
root = etree.fromstring(xml)
dict_ = {}
dict_.update(root.attrib)
dict_.update({'text': root.text})
print(dict_)
>>> {'href': 'https://wikipedia.org/', 'rel': 'nofollow ugc', 'text': 'wiki'}
EDIT
-------regex parsing [X]HTML is deprecated!-------
Solution with regex:
import re
pattern_text = r"[>](\w+)[<]"
pattern_href = r'href="(\w\S+)"'
pattern_rel = r'rel="([A-z ]+)"'
xml = '<a href="https://wikipedia.org/" rel="nofollow ugc">wiki</a>'
dict_ = {
'href': re.search(pattern_href, xml).group(1),
'rel': re.search(pattern_rel, xml).group(1),
'text': re.search(pattern_text, xml).group(1)
}
print(dict_)
>>> {'href': 'https://wikipedia.org/', 'rel': 'nofollow ugc', 'text': 'wiki'}
It will work if input is string.