Soup not locating proper div tag when searched by text

Question

This is condensed version of the actual html which has many more tags.

html = '''

<div style="line-height:120%;padding-top:12px;text-align:left;text-indent:24px;font-size:10pt;">

    <font style="font-family:inherit;font-size:10pt;">
        Indicate by checkmark whether the registrant is a shell company (as defined in Rule 12b-2 of the Exchange Act). Yes
    </font>

    <font style="font-family:Wingdings;font-size:10pt;">
        ¨
    </font>

    <font style="font-family:inherit;font-size:10pt;">
        No
    </font>

    <font style="font-family:Wingdings;font-size:10pt;">
        x
    </font>

</div>

<div style="line-height:120%;padding-top:12px;text-align:left;font-size:10pt;">

    <font style="font-family:inherit;font-size:10pt;">
        There were
    </font>

    <font style="font-family:inherit;font-size:10pt;">
        33,012,179
    </font>

    <font style="font-family:inherit;font-size:10pt;">
        shares of common stock, $.01 par value per share, outstanding at
    </font>

    <font style="font-family:inherit;font-size:10pt;">
        July&nbsp;26, 2017
    </font>

    <font style="font-family:inherit;font-size:10pt;">
        .
    </font>

</div>

'''

I am attempting to locate a tag based of text. The text is a form of regex which is located all within a div tag.

month_pattern = r'((Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|(Nov|Dec)(?:ember)?)\s?(\d{1,2}\D?)\s?(19[7-9]\d|20\d{2}|\d{2}))'

word_pattern = r'(?=.*common)(?=.*outstanding[.,]?)(?=.*shares[.,]?)(?=.*stock[.,]?)'


pattern = word_pattern + '.*' + month_pattern

The above regex is slightly complicated, but it works when I test it as a stand-alone test on on the text within div.

With the soup code below, I'm expecting a return of a type of soup object whose parent is the first div tag, however I am getting an empty list.

soup = bs(html, 'html.parser')

elem =  soup(text=re.compile(pattern, flags = re.IGNORECASE|re.DOTALL))
print(elem)

results in

[]

I suspect this problem is because the div's text is further nested within <font> text? However, if I execute div.text all of the text is printed out, so I'm not sure why I am not getting any hits.

'''There were
    

        33,012,179
    

        shares of common stock, $.01 par value per share, outstanding at
    

        July 26, 2017
    

        .

        '''

Once again, regex is not a problem as via re module, I have:

print(re.search(pattern,text, flags = re.IGNORECASE|re.DOTALL))

with result:

<_sre.SRE_Match object; span=(0, 142), match='There were\n    \n\n        33,012,179\n    \n\n >

Excepted Result:

I am expecting elem to be a non-empty list, so that if I run elem.parent as show here in accepted answer,

Using BeautifulSoup to find a HTML tag that contains certain text

I will be able to extract the first div tag with its inner html as follows:

  <div style="line-height:120%;padding-top:12px;text-align:left;text-indent:24px;font-size:10pt;">
    
        <font style="font-family:inherit;font-size:10pt;">
            Indicate by checkmark whether the registrant is a shell company (as defined in Rule 12b-2 of the Exchange Act). Yes
        </font>
    
        <font style="font-family:Wingdings;font-size:10pt;">
            ¨
        </font>
    
        <font style="font-family:inherit;font-size:10pt;">
            No
        </font>
    
        <font style="font-family:Wingdings;font-size:10pt;">
            x
        </font>
    
    </div>

However, I am getting back an empty list, so elem.parent returns nothing if I iterate

Thank you.

Here is the full code for easy c&p:

#testing_html


from bs4 import BeautifulSoup as bs
import re
import os


html = '''

<div style="line-height:120%;padding-top:12px;text-align:left;text-indent:24px;font-size:10pt;">

    <font style="font-family:inherit;font-size:10pt;">
        Indicate by checkmark whether the registrant is a shell company (as defined in Rule 12b-2 of the Exchange Act). Yes
    </font>

    <font style="font-family:Wingdings;font-size:10pt;">
        ¨
    </font>

    <font style="font-family:inherit;font-size:10pt;">
        No
    </font>

    <font style="font-family:Wingdings;font-size:10pt;">
        x
    </font>

</div>

<div style="line-height:120%;padding-top:12px;text-align:left;font-size:10pt;">

    <font style="font-family:inherit;font-size:10pt;">
        There were
    </font>

    <font style="font-family:inherit;font-size:10pt;">
        33,012,179
    </font>

    <font style="font-family:inherit;font-size:10pt;">
        shares of common stock, $.01 par value per share, outstanding at
    </font>

    <font style="font-family:inherit;font-size:10pt;">
        July&nbsp;26, 2017
    </font>

    <font style="font-family:inherit;font-size:10pt;">
        .
    </font>

</div>

'''

text = '''There were
    

        33,012,179
    

        shares of common stock, $.01 par value per share, outstanding at
    

        July 26, 2017
    

        .

        '''


month_pattern = r'((Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|(Nov|Dec)(?:ember)?)\s?(\d{1,2}\D?)\s?(19[7-9]\d|20\d{2}|\d{2}))'

word_pattern = r'(?=.*common)(?=.*outstanding[.,]?)(?=.*shares[.,]?)(?=.*stock[.,]?)'


pattern = word_pattern + '.*' + month_pattern

soup = bs(html, 'html.parser')

elem =  soup(text=re.compile(pattern, flags = re.IGNORECASE|re.DOTALL))
print(elem)

print(re.search(pattern,text, flags = re.IGNORECASE|re.DOTALL))

It seems you are trying to parse an EDGAR filing, but otherwise your question is unclear. Given the sample html in the question, what exactly is your expected output? — Jack Fleeting, Oct 01 '20 at 17:23
Hi Jack. `elem` should *not* be an empty list. Instead `soup` should be able to capture the tag with text that is matched by the regex. So it's this line: `elem = soup(text=re.compile(pattern, flags = re.IGNORECASE|re.DOTALL))` print(elem) — MasayoMusic, Oct 01 '20 at 17:39
I'm afraid it doesn't answer the question. Given your sample html, if you did `print(elem)`, would you expect the output to be? — Jack Fleeting, Oct 01 '20 at 18:46
@JackFleeting I've updated my OP (check the expected result portion). Basically it should return a non-null value. I am not exactly sure what text it will return, but it should be non-null so I can use `elem.parent` to get back any tags with fit the regex criteria. — MasayoMusic, Oct 01 '20 at 19:38
Let's try is differently - you have two `
` elements in your sample html. Are you trying to get the second one based on its text and then find its parent? If that's the case, part of the problem is that there is no parent for that element in your sample html. — Jack Fleeting, Oct 12 '20 at 19:57
No, the code locates the "text" as soup object and then calling `parent` gives you access to the `div` element containing the text. Basically, I don't know which element will contain the text, so I use soup to search by "text" or regex pattern in my case. Once it gets a hit, I can call `parent` method to get access to the parent tag of the text, which is `div`. I have a stackoverflow link within my OP, that shows this usage, as that's what I used to figure out how to search html by text to locate proper tag. — MasayoMusic, Oct 13 '20 at 20:33

score 2 · Accepted Answer · answered Oct 13 '20 at 22:27

2

I think I understand the problem now...

One issue you have is that your final regex expression pattern = word_pattern + '.*' + month_pattern can not find the target text because the target text is spread between several <font> nodes so that no single node has the full pattern. In this case, the text is spread between two nodes. Both these nodes do have the same common grandfather - the <div> in question. You can get to it by calling parent twice.

This can be resolved with something along these lines:

elem_m =  soup(text=re.compile(month_pattern))
elem_w =  soup(text=re.compile(word_pattern))

if elem_m[0].parent.parent==elem_w[0].parent.parent:
    print((elem_m[0].parent.parent).text.strip())

More fundamentally, if you search around you'll see that using regex in the context of html/xml is highly discouraged. In order to avoid that, I would do something like this:

key_words = ['common','shares','stock,',"outstanding"]
months = ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December'] 

for s in soup.select('*'):   
    words = all(word in s.text  for word in key_words)
    month = any(month in s.text  for month in months)
    if words == True and month == True:
        print(s.text.strip())

The output, in both cases, is:

There were
   

   33,012,179
   

   shares of common stock, $.01 par value per share, outstanding at
   

   July 26, 2017
   
        .

Good luck parsing EDGAR filings; not the most fun activity I can think of...

answered Oct 13 '20 at 22:27

Jack Fleeting

24,385
6
23
45

Yes The text being spread among multiple was what seems to be the problem. I was under the impression that Soup would loop over parent tags as well as child tags, since `soup.find(div.text)` was returning the entire text. I will need to test out your logic and see if there are any edge cases. Thank yo so much for you patience and help. – MasayoMusic Oct 14 '20 at 02:29
1

Can _if words == True and month == True_ not just be _if words and month:_ – QHarr Oct 16 '20 at 02:29
@QHarr - Brilliant! It works like that as well! For some reason, it never occurred to me... – Jack Fleeting Oct 16 '20 at 10:22
Okay. Just got around to checking it. Problem that seems to be occurring is the opposite now: I am getting over 100+ hits per sec filling. It's including the tags like `html`, `body` and `type`, `filename`, `description` etc which can be removed. But I am also lots getting `p` tags for one filling with the true tag being somewhere in middle of all of the `p` tags making it hard to pick the right one. Let me see if I can link you to a filling. – MasayoMusic Oct 17 '20 at 22:56
Here is one of the fillings: https://www.sec.gov/ix?doc=/Archives/edgar/data/6201/000000620120000089/q2202010-q063020.htm I've downloaded it using requests. If I run your code on it, I get too many hits. – MasayoMusic Oct 17 '20 at 23:01
@MasayoMusic You are dealing with EDGAR filings - no two filings are formatted the same, even if filed by the same issuer, even if filed in the same fiscal quarter. Filing edgarization seems almost random which is why scraping them is so frustrating. You'll will have to modify the script for almost each and every filing. Unfortunately, there's no way around that, AFAIK. – Jack Fleeting Oct 18 '20 at 01:37
Thanks. I think I will probably need to use stack multiple soup extraction methods to make sure to get the edge cases. – MasayoMusic Oct 19 '20 at 01:08

Soup not locating proper div tag when searched by text

Excepted Result:

1 Answers1