Using Python beautifulsoup to select everything except a specific tag

Question

I've got over 1000 html files which have different formatting, elements and contents. I need to recursively go through each and select all elements except the <h1>element.

Here is a sample file (note that this is the smallest and simplest of the files, the remainder are substantially larger and more complex with many different elements which do not conform to any single template, other than the beginning with the <h1> element):

<h1>CXR Introduction</h1>
<h2>Basic Principles</h2>

<ul>
<li>Note differences in density.</li>
<li>Identify the site of the pathology by noting silhouettes.</li>
<li>If you can’t see lung vessels, then the pathology must be within the lung.</li>
<li>Loss of the ability to see lung vessels is supplanted by the ability to see air-bronchograms.</li>
</ul>

<p><a href="./A-CXR-TERMINOLOGY-2301158c-efe4-456e-9e0b-5747c5f3e1ce.md">A. CXR-TERMINOLOGY</a></p>
<p><a href="./B-SOME-RADIOLOGICAL-PATHOLOGY-2610a46c-44ca-4f81-a496-9ea3b911cb4e.md">B. SOME RADIOLOGICAL PATHOLOGY</a></p>
<p><a href="./C-Approach-to-common-clinical-scenarios-0e8f5c90-b14b-48d4-8484-0b0f8ca4464c.md">C. Approach to common clinical scenarios</a></p>

I wrote this code using beautifulsoup:

with open("file.htm") as ip:
    #HTML parsing done using the "html.parser".
    soup = BeautifulSoup(ip, "html.parser")
    selection = soup.select("h1 > ")
print(selection)

I was hoping that this will select everything below the <h1> element, however it does not. Using soup.select("h1") only selects one line and doesn't select everything below it. What do I do?

score 2 · Accepted Answer · answered Nov 17 '18 at 07:23

2

use .extract() to remove selected tag

output = None
with open("file.htm") as ip:
    #HTML parsing done using the "html.parser".
    soup = BeautifulSoup(ip, "html.parser")
    soup.h1.extract()
    output = soup

print(output)

answered Nov 17 '18 at 07:23

ewwink

18,382
2
44
54

score 0 · Answer 2 · answered Nov 17 '18 at 07:22

0

Have you considered removing the <h1>...<h1/> element using .decompose() and then just getting all the rest?

answered Nov 17 '18 at 07:22

Khalt

301
2
7

Using Python beautifulsoup to select everything except a specific tag

2 Answers2