-1

My entry (The variable is of string type):

<a href="https://wikipedia.org/" rel="nofollow ugc">wiki</a>

My expected output:

{
'href': 'https://wikipedia.org/',
'rel': 'nofollow ugc',
'text': 'wiki',
}

How can I do this with Python? Without using beautifulsoup Library
Please tell with the help of lxml library

Sardar
  • 524
  • 1
  • 6
  • 19
  • 1
    use `lxml` instead of `beautifulsoup` – furas Aug 16 '22 at 12:51
  • 1
    you could try to use `regex` but it can be very complex task in some situations so better use `beautifulsoup`, `lxml` or similar modules. – furas Aug 16 '22 at 12:53
  • @Curiouskoala That's right, thanks for helping me get to the answer. – Sardar Aug 16 '22 at 12:53
  • 1
    @StevenRumbalski I am sorry for your opinion. I tried for almost 2 days but did not get the result and then I asked a question. – Sardar Aug 16 '22 at 13:30
  • @Sardar So where is the code that you tried? You need to include that in your question so that people can understand where you're going wrong. – Steven Rumbalski Aug 16 '22 at 14:20
  • @StevenRumbalski My code could have caused the question text to become cluttered. To ask a clear and short question, it is better if my code is not in the question. This clear and short question made me get the answer quickly. – Sardar Aug 16 '22 at 14:39

3 Answers3

2

Solution with lxml (but without bs!):

from lxml import etree

xml = '<a href="https://wikipedia.org/" rel="nofollow ugc">wiki</a>'
root = etree.fromstring(xml)
print(root.attrib)

>>> {'href': 'https://wikipedia.org/', 'rel': 'nofollow ugc'}

But there's no text attribute. You can extract it by using text property:

print(root.text)
>>> 'wiki'

To conclusion:

from lxml import etree

xml = '<a href="https://wikipedia.org/" rel="nofollow ugc">wiki</a>'
root = etree.fromstring(xml)
dict_ = {}
dict_.update(root.attrib)
dict_.update({'text': root.text})
print(dict_)
>>> {'href': 'https://wikipedia.org/', 'rel': 'nofollow ugc', 'text': 'wiki'}

EDIT

-------regex parsing [X]HTML is deprecated!-------

Solution with regex:

import re
pattern_text = r"[>](\w+)[<]"
pattern_href = r'href="(\w\S+)"'
pattern_rel = r'rel="([A-z ]+)"'

xml = '<a href="https://wikipedia.org/" rel="nofollow ugc">wiki</a>'
dict_ = {
    'href': re.search(pattern_href, xml).group(1),
    'rel': re.search(pattern_rel, xml).group(1),
    'text': re.search(pattern_text, xml).group(1)
}
print(dict_)

>>> {'href': 'https://wikipedia.org/', 'rel': 'nofollow ugc', 'text': 'wiki'}

It will work if input is string.

vovakirdan
  • 345
  • 3
  • 11
  • 1
    Thanks for the reply. There is no way to get text with the help of lxml library? – Sardar Aug 16 '22 at 13:08
  • 1
    @Sardar you can read official page [parsing html with lxml](https://lxml.de/parsing.html). I took all info from that. – vovakirdan Aug 16 '22 at 13:12
  • 2
    @Sardar, author of this response, and other people reading this: You **do not** parse HTML with regex, ever: https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/ Of course, marking this answer as accepted only adds insult to the injury. – Barry the Platipus Aug 16 '22 at 13:16
  • 1
    @platipus_on_fire Thank you for your important point. i use the lxml library and i do not parse HTML with regex – Sardar Aug 16 '22 at 13:22
  • 1
    You're right! But someone suggested that it could be parsed by re... It is just a try for a **very** special situation. Anyway I wrote a solution by lxml lib. – vovakirdan Aug 16 '22 at 13:23
2

This is how you do it with lxml:

from lxml import etree

html = '''<a href="https://wikipedia.org/" rel="nofollow ugc">wiki</a>'''
root = etree.fromstring(html)
attrib_dict = root.attrib
attrib_dict['text'] = root.text 
print(attrib_dict)

Result:

{'href': 'https://wikipedia.org/', 'rel': 'nofollow ugc', 'text': 'wiki'}
Barry the Platipus
  • 9,594
  • 2
  • 6
  • 30
1

While using BeautifulSoup you could use .attrs to get a dict of of a tags attributes:

from bs4 import BeautifulSoup
soup = BeautifulSoup('<a href="https://wikipedia.org/" rel="nofollow ugc">wiki</a>')
soup.a.attrs

--> {'href': 'https://wikipedia.org/', 'rel': ['nofollow', 'ugc']}

To get also the text:

...
data = soup.a.attrs
data.update({'text':soup.a.text})
print(data)

--> {'href': 'https://wikipedia.org/', 'rel': ['nofollow', 'ugc'], 'text': 'wiki'}
HedgeHog
  • 22,146
  • 4
  • 14
  • 36
  • 1
    Thanks for the reply. You can tell with the help of lxml library? – Sardar Aug 16 '22 at 13:04
  • 1
    This is solution with BeautifulSoup which deprecated by author of question :) Via bs it will be easy to unpack any lxml constructions by using some magic lambda functions. – vovakirdan Aug 16 '22 at 13:10
  • 2
    @vovakirdan: You are right, to fast while reading - So just see it as example in addition, if somebody comes across. – HedgeHog Aug 16 '22 at 13:19