I've got over 1000 html files which have different formatting, elements and contents. I need to recursively go through each and select all elements except the <h1>
element.
Here is a sample file (note that this is the smallest and simplest of the files, the remainder are substantially larger and more complex with many different elements which do not conform to any single template, other than the beginning with the <h1>
element):
<h1>CXR Introduction</h1>
<h2>Basic Principles</h2>
<ul>
<li>Note differences in density.</li>
<li>Identify the site of the pathology by noting silhouettes.</li>
<li>If you can’t see lung vessels, then the pathology must be within the lung.</li>
<li>Loss of the ability to see lung vessels is supplanted by the ability to see air-bronchograms.</li>
</ul>
<p><a href="./A-CXR-TERMINOLOGY-2301158c-efe4-456e-9e0b-5747c5f3e1ce.md">A. CXR-TERMINOLOGY</a></p>
<p><a href="./B-SOME-RADIOLOGICAL-PATHOLOGY-2610a46c-44ca-4f81-a496-9ea3b911cb4e.md">B. SOME RADIOLOGICAL PATHOLOGY</a></p>
<p><a href="./C-Approach-to-common-clinical-scenarios-0e8f5c90-b14b-48d4-8484-0b0f8ca4464c.md">C. Approach to common clinical scenarios</a></p>
I wrote this code using beautifulsoup:
with open("file.htm") as ip:
#HTML parsing done using the "html.parser".
soup = BeautifulSoup(ip, "html.parser")
selection = soup.select("h1 > ")
print(selection)
I was hoping that this will select everything below the <h1>
element, however it does not. Using soup.select("h1")
only selects one line and doesn't select everything below it. What do I do?