3

I have 1000s of html files, and I want to extract a section "ITEM 1A. RISK FACTORS" from those files. None of the files have any ids or anything and most of them have a different format like, some of them have text in "div" tags, others have it in "p", "table", etc.

Given a specific format, I am able to extract a section of text. For example, here; I was able to extract the the text from the section ITEM 1A. RISK FACTORS using this piece of code.

should_print = False

for item in soup.find_all("div"):
    if (item.name == "div" and item.parent.name != "div"):
        if "ITEM" in item.text and "1A" in item.text and "RISK" in item.text and "FACTORS" in item.text:
            should_print = True
        elif "ITEM" in item.text and "1B" in item.text:
            break
        if should_print:
            with open(r"RF.html", "a") as f:
                f.write(str(item))

I can write a code to cater to all the formats but how will I identify what code to run on which file? Suppose, if I run this^ code on the file which contains the text in "p" tags, it would give me rubbish text.

Here and here are some more examples of html files.

Rishab Gupta
  • 561
  • 3
  • 17
  • this is intriguing, never seen a question relative to this. Definitely interesting, Im sad i cant help you out though as I dont know much js. I have a question similar to this, so i hope someone gives a good answer. – Mister SirCode Jun 17 '19 at 19:52
  • The question is too generic. Do you know what sections do you want to extract? Do you want just `ITEM 1A. RISK FACTORS` sections? – LMC Jun 17 '19 at 19:56
  • @LuisMuñoz yes, only the section ITEM 1A. RISK FACTORS. I'll edit my question. :) – Rishab Gupta Jun 17 '19 at 20:02
  • 1
    You can try to meta-analyze the texts - if they have lots of text in between P's use that else try DIVs ... – Patrick Artner Jun 17 '19 at 20:03
  • Please, add a couple of other link to examples. – LMC Jun 17 '19 at 20:05
  • 1
    My suggestion: See [this question](https://stackoverflow.com/questions/328356/extracting-text-from-html-file-using-python) to extract the data and *then* go regex on it – salvatore Jun 17 '19 at 20:11

2 Answers2

0

You just need to change your if condition because you are doing just false to true but item in loop is still refers to soup.find_all("div")

Change if conditiom to :

  if "ITEM" in item.text and "1A" in item.text and "RISK" in item.text and "FACTORS" in item.text:
        print (item.find('b').text)

Output :

ITEM 1A. RISK FACTORS.

In if statement :

print (item.text) will show all text

print (item) will show all source who has a string ITEM , 1A,RISK

Community
  • 1
  • 1
Omer Tekbiyik
  • 4,255
  • 1
  • 15
  • 27
0

A good option would be to look for the section title using XPath, that could provide a generic solution. Below, an example using xmllint in bash but xml.etree.ElementTree in python should do the work

xmllint -html -recover -xpath '//div[descendant-or-self::*[.="ITEM 1A. RISK FACTORS."]]/descendant-or-self::text()' 2>/dev/null 10k.htm

Xpath explained:

  • //div[descendant-or-self::... Get a div having a child as defined by the expression (explained below).
  • descendant-or-self::*[.="ITEM 1A. RISK FACTORS."] find any node containing the expected title.

  • descendant-or-self::text() Get text for all contained elements.

Xpath to detect title using contains(...)

'//div[descendant-or-self::text()[contains(.,"ITEM 1A. RISK FACTORS")]]/descendant-or-self::text()'
LMC
  • 10,453
  • 2
  • 27
  • 52