0

Before uploading on my server I want to check if I accidentally defined an id two or more times in one of my html files:

<!doctype html>

<html lang="en">
<head>
  <meta charset="utf-8">

  <title>The HTML5 Herald</title>
  <meta name="description" content="The HTML5 Herald">
  <meta name="author" content="SitePoint">

  <link rel="stylesheet" href="css/styles.css?v=1.0">

</head>

<body>
  <div id="test"></div>
  <div id="test"></div>
</body>
</html>

The idea is to print an error message if there are duplicates:

"ERROR: The id="test" is not unique."
Sr. Schneider
  • 647
  • 10
  • 20

2 Answers2

2

You can do this by using find_all to gather all elements with an id attribute, and then collections.Counter to collect the ids that contain duplicates

import bs4
import collections

soup = bs4.BeautifulSoup(html)
ids = [a.attrs['id'] for a in soup.find_all(attrs={'id': True})]
ids = collections.Counter(ids)
dups = [key for key, value in ids.items() if value > 1]

for d in dups:
    print('ERROR: The id="{}" is not unique.'.format(d))


>>> ERROR: The id="test" is not unique.
Wondercricket
  • 7,651
  • 2
  • 39
  • 58
0

You could use a regex to find all ids in the HTML and then search for duplicates.

For example:

import re

html_page = """
<!doctype html>

<html lang="en">
<head>
  <meta charset="utf-8">

  <title>$The HTML5 Herald</title>
  <div id="test1"></div>
  <meta name="description" content="The HTML5 Herald">
  <meta name="author" content="SitePoint">

  <link $rel="stylesheet" href="css/styles.css?v=1.0">

</head>

<body>
  <div id="test2"></div>
  <div id="test2"></div>
</body>
<div id="test3"></div>
</html>
"""    
ids_match = re.findall(r'(?<=\s)id=\"\w+\"',html_page) 

print(ids_match) #-> ['id="test1"', 'id="test2"', 'id="test2"', 'id="test3"']
print(len(ids_match)) #-> 4
print(len(set(ids_match))) #->3

# the following returns True if there are dupicates in ids_match
print(len(ids_match) != len(set(ids_match))) #->True
Francesco Pegoraro
  • 778
  • 13
  • 33
  • 2
    I do not recommend this. – erip Jan 05 '21 at 18:27
  • 2
    And [here is why it's not recommended](https://stackoverflow.com/q/1732348/1040092) – Wondercricket Jan 05 '21 at 18:36
  • Yes @Wondercricket your implementstion is definitely better, you got my upvote. And it is a funny thread the one you posted ! Anyway it seems unlikely that OP is going to have problem using regex on HTML and that the problem at the link is goin to exist... – Francesco Pegoraro Jan 06 '21 at 00:03