1

I have a HTML file and I want to loop through the content and remove all the attributes in the tags and only display the tags. for example:

<div class="content"><div/>
<div id="content"><div/>
<p> test</p>
<h1>tt</h1>

the output should be:

<div></div>
<div></div>
<p> </p>
<h1></h1>

At the moment I can display all tags with all the attributes, but I only want to display the tags without the attributes.

import re
file = open('myfile.html')
readtext = file.read()
lines = text.splitlines()
tags = re.findall(r'<[^>]+>',readtext)
for data in tags:
    print(a)
user11766958
  • 409
  • 3
  • 12

1 Answers1

1

I think the easiest way to do this is to parse the HTML, e.g. with BeautifulSoup. Here is an answer that shows how to solve your problem using that: https://stackoverflow.com/a/9045719/5251061

Also, take a look at this gist: https://gist.github.com/revotu/21d52bd20a073546983985ba3bf55deb

Basically, after parsing your file you can do something like this:

from bs4 import BeautifulSoup

# remove all attributes
def _remove_all_attrs(soup):
    for tag in soup.find_all(True): 
        tag.attrs = {}
return soup 
mc51
  • 1,883
  • 14
  • 28