Separating tag attributes as a dictionary

Question

My entry (The variable is of string type):

<a href="https://wikipedia.org/" rel="nofollow ugc">wiki</a>

My expected output:

{
'href': 'https://wikipedia.org/',
'rel': 'nofollow ugc',
'text': 'wiki',
}

How can I do this with Python? Without using beautifulsoup Library
Please tell with the help of lxml library

you could try to use `regex` but it can be very complex task in some situations so better use `beautifulsoup`, `lxml` or similar modules. — furas, Aug 16 '22 at 12:53
@Curiouskoala That's right, thanks for helping me get to the answer. — Sardar, Aug 16 '22 at 12:53
@StevenRumbalski I am sorry for your opinion. I tried for almost 2 days but did not get the result and then I asked a question. — Sardar, Aug 16 '22 at 13:30
@Sardar So where is the code that you tried? You need to include that in your question so that people can understand where you're going wrong. — Steven Rumbalski, Aug 16 '22 at 14:20
@StevenRumbalski My code could have caused the question text to become cluttered. To ask a clear and short question, it is better if my code is not in the question. This clear and short question made me get the answer quickly. — Sardar, Aug 16 '22 at 14:39

vovakirdan · Accepted Answer · 2022-08-16T13:20:58.047

2

Solution with lxml (but without bs!):

from lxml import etree

xml = '<a href="https://wikipedia.org/" rel="nofollow ugc">wiki</a>'
root = etree.fromstring(xml)
print(root.attrib)

>>> {'href': 'https://wikipedia.org/', 'rel': 'nofollow ugc'}

But there's no text attribute. You can extract it by using text property:

print(root.text)
>>> 'wiki'

To conclusion:

from lxml import etree

xml = '<a href="https://wikipedia.org/" rel="nofollow ugc">wiki</a>'
root = etree.fromstring(xml)
dict_ = {}
dict_.update(root.attrib)
dict_.update({'text': root.text})
print(dict_)
>>> {'href': 'https://wikipedia.org/', 'rel': 'nofollow ugc', 'text': 'wiki'}

EDIT

-------regex parsing [X]HTML is deprecated!-------

Solution with regex:

import re
pattern_text = r"[>](\w+)[<]"
pattern_href = r'href="(\w\S+)"'
pattern_rel = r'rel="([A-z ]+)"'

xml = '<a href="https://wikipedia.org/" rel="nofollow ugc">wiki</a>'
dict_ = {
    'href': re.search(pattern_href, xml).group(1),
    'rel': re.search(pattern_rel, xml).group(1),
    'text': re.search(pattern_text, xml).group(1)
}
print(dict_)

>>> {'href': 'https://wikipedia.org/', 'rel': 'nofollow ugc', 'text': 'wiki'}

It will work if input is string.

edited Aug 16 '22 at 13:20

answered Aug 16 '22 at 13:01

vovakirdan

345
3
11

1

Thanks for the reply. There is no way to get text with the help of lxml library? – Sardar Aug 16 '22 at 13:08
1

@Sardar you can read official page [parsing html with lxml](https://lxml.de/parsing.html). I took all info from that. – vovakirdan Aug 16 '22 at 13:12
2

@Sardar, author of this response, and other people reading this: You **do not** parse HTML with regex, ever: https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/ Of course, marking this answer as accepted only adds insult to the injury. – Barry the Platipus Aug 16 '22 at 13:16
1

@platipus_on_fire Thank you for your important point. i use the lxml library and i do not parse HTML with regex – Sardar Aug 16 '22 at 13:22
1

You're right! But someone suggested that it could be parsed by re... It is just a try for a **very** special situation. Anyway I wrote a solution by lxml lib. – vovakirdan Aug 16 '22 at 13:23

score 2 · Answer 2 · answered Aug 16 '22 at 13:14

This is how you do it with lxml:

from lxml import etree

html = '''<a href="https://wikipedia.org/" rel="nofollow ugc">wiki</a>'''
root = etree.fromstring(html)
attrib_dict = root.attrib
attrib_dict['text'] = root.text 
print(attrib_dict)

Result:

{'href': 'https://wikipedia.org/', 'rel': 'nofollow ugc', 'text': 'wiki'}

score 1 · Answer 3 · answered Aug 16 '22 at 13:01

1

While using BeautifulSoup you could use .attrs to get a dict of of a tags attributes:

from bs4 import BeautifulSoup
soup = BeautifulSoup('<a href="https://wikipedia.org/" rel="nofollow ugc">wiki</a>')
soup.a.attrs

--> {'href': 'https://wikipedia.org/', 'rel': ['nofollow', 'ugc']}

To get also the text:

...
data = soup.a.attrs
data.update({'text':soup.a.text})
print(data)

--> {'href': 'https://wikipedia.org/', 'rel': ['nofollow', 'ugc'], 'text': 'wiki'}

answered Aug 16 '22 at 13:01

HedgeHog

22,146
4
14
36

1

Thanks for the reply. You can tell with the help of lxml library? – Sardar Aug 16 '22 at 13:04
1

This is solution with BeautifulSoup which deprecated by author of question :) Via bs it will be easy to unpack any lxml constructions by using some magic lambda functions. – vovakirdan Aug 16 '22 at 13:10
2

@vovakirdan: You are right, to fast while reading - So just see it as example in addition, if somebody comes across. – HedgeHog Aug 16 '22 at 13:19

Separating tag attributes as a dictionary

3 Answers3

Linked