I have 1000s of html files, and I want to extract a section "ITEM 1A. RISK FACTORS" from those files. None of the files have any ids or anything and most of them have a different format like, some of them have text in "div" tags, others have it in "p", "table", etc.
Given a specific format, I am able to extract a section of text. For example, here; I was able to extract the the text from the section ITEM 1A. RISK FACTORS using this piece of code.
should_print = False
for item in soup.find_all("div"):
if (item.name == "div" and item.parent.name != "div"):
if "ITEM" in item.text and "1A" in item.text and "RISK" in item.text and "FACTORS" in item.text:
should_print = True
elif "ITEM" in item.text and "1B" in item.text:
break
if should_print:
with open(r"RF.html", "a") as f:
f.write(str(item))
I can write a code to cater to all the formats but how will I identify what code to run on which file? Suppose, if I run this^ code on the file which contains the text in "p" tags, it would give me rubbish text.