I want to extract text from 1000s of html files with different formats

Question

I have 1000s of html files, and I want to extract a section "ITEM 1A. RISK FACTORS" from those files. None of the files have any ids or anything and most of them have a different format like, some of them have text in "div" tags, others have it in "p", "table", etc.

Given a specific format, I am able to extract a section of text. For example, here; I was able to extract the the text from the section ITEM 1A. RISK FACTORS using this piece of code.

should_print = False

for item in soup.find_all("div"):
    if (item.name == "div" and item.parent.name != "div"):
        if "ITEM" in item.text and "1A" in item.text and "RISK" in item.text and "FACTORS" in item.text:
            should_print = True
        elif "ITEM" in item.text and "1B" in item.text:
            break
        if should_print:
            with open(r"RF.html", "a") as f:
                f.write(str(item))

I can write a code to cater to all the formats but how will I identify what code to run on which file? Suppose, if I run this^ code on the file which contains the text in "p" tags, it would give me rubbish text.

Here and here are some more examples of html files.

this is intriguing, never seen a question relative to this. Definitely interesting, Im sad i cant help you out though as I dont know much js. I have a question similar to this, so i hope someone gives a good answer. — Mister SirCode, Jun 17 '19 at 19:52
The question is too generic. Do you know what sections do you want to extract? Do you want just `ITEM 1A. RISK FACTORS` sections? — LMC, Jun 17 '19 at 19:56
@LuisMuñoz yes, only the section ITEM 1A. RISK FACTORS. I'll edit my question. :) — Rishab Gupta, Jun 17 '19 at 20:02
You can try to meta-analyze the texts - if they have lots of text in between P's use that else try DIVs ... — Patrick Artner, Jun 17 '19 at 20:03
My suggestion: See [this question](https://stackoverflow.com/questions/328356/extracting-text-from-html-file-using-python) to extract the data and *then* go regex on it — salvatore, Jun 17 '19 at 20:11

score 0 · Answer 1 · edited Jun 20 '20 at 09:12

You just need to change your if condition because you are doing just false to true but item in loop is still refers to soup.find_all("div")

Change if conditiom to :

  if "ITEM" in item.text and "1A" in item.text and "RISK" in item.text and "FACTORS" in item.text:
        print (item.find('b').text)

Output :

ITEM 1A. RISK FACTORS.

In if statement :

print (item.text) will show all text

print (item) will show all source who has a string ITEM , 1A,RISK

LMC · Answer 2 · 2019-06-17T20:30:03.880

A good option would be to look for the section title using XPath, that could provide a generic solution. Below, an example using xmllint in bash but xml.etree.ElementTree in python should do the work

xmllint -html -recover -xpath '//div[descendant-or-self::*[.="ITEM 1A. RISK FACTORS."]]/descendant-or-self::text()' 2>/dev/null 10k.htm

Xpath explained:

//div[descendant-or-self::... Get a div having a child as defined by the expression (explained below).
descendant-or-self::*[.="ITEM 1A. RISK FACTORS."] find any node containing the expected title.
descendant-or-self::text() Get text for all contained elements.

Xpath to detect title using contains(...)

'//div[descendant-or-self::text()[contains(.,"ITEM 1A. RISK FACTORS")]]/descendant-or-self::text()'

I want to extract text from 1000s of html files with different formats

2 Answers2