1

I want to get a list containing all different tag names of a HTML document (a list of string of tag names without repetition). I tried putting empty entry with soup.findall(), but this gave me the entire document instead.

Is there a way of doing it?

firelitte
  • 53
  • 8

1 Answers1

5

Using soup.findall() you get a list of every single element you can iterate over. Therefore you can do the following:

from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""  # an html sample
soup = BeautifulSoup(html_doc, 'html.parser')

document = soup.html.find_all()

el = ['html',]  # we already include the html tag
for n in document:
    if n.name not in el:
        el.append(n.name)

print(el)


The output of the code snippet would be:

>>> ['head', 'title', 'body', 'p', 'b', 'a']


Edit

As @PM 2Ring Pointed out there, if you don't care about the order in which the elements are added (which as he says I don't think it is the case), then you may use sets. In Python 3.x you don't have to import it, but if you use an older version you may want to check whether it is supported.

from bs4 import BeautifulSoup

...

el = {x.name for x in document} # use a set comprehension to generate it easily
el.add("html")  # only if you need to
  • Simple and clear. Good move. I guess it also works with the header tags, doesn't it? – wonderwhy Jul 08 '16 at 17:08
  • 1
    btw, you are missing the html tag in there. – wonderwhy Jul 08 '16 at 17:08
  • 1
    @wonderwhy - edited. Now it is included as well. Yes, it also iterates over the `` tag. The output for that would be: `['html', 'head', 'title', 'body', 'p', 'b', 'a']` –  Jul 08 '16 at 17:11
  • Using a set instead of a list for `el` is _much_ more efficient since you don't need to bother doing the `in` test. Of course, a set doesn't preserve order, but that's probably not an issue here. And if the OP really needs a list it's easy enough to convert the set to a list at the end. – PM 2Ring Jul 08 '16 at 17:42
  • any idea why my code is throwing out an attribute error? i tried your code and have posted my attempt in the edit. – firelitte Jul 08 '16 at 17:45
  • @PM2Ring added an improvement. –  Jul 08 '16 at 17:53
  • 1
    @firelitte first of all put the URL inside string tags. –  Jul 08 '16 at 17:53
  • I don't use BeautifulSoup, but it looks like you have a typo in that set comp: I think it should be `{x.name for x in document}` – PM 2Ring Jul 08 '16 at 18:35