This is condensed version of the actual html which has many more tags.
html = '''
<div style="line-height:120%;padding-top:12px;text-align:left;text-indent:24px;font-size:10pt;">
<font style="font-family:inherit;font-size:10pt;">
Indicate by checkmark whether the registrant is a shell company (as defined in Rule 12b-2 of the Exchange Act). Yes
</font>
<font style="font-family:Wingdings;font-size:10pt;">
¨
</font>
<font style="font-family:inherit;font-size:10pt;">
No
</font>
<font style="font-family:Wingdings;font-size:10pt;">
x
</font>
</div>
<div style="line-height:120%;padding-top:12px;text-align:left;font-size:10pt;">
<font style="font-family:inherit;font-size:10pt;">
There were
</font>
<font style="font-family:inherit;font-size:10pt;">
33,012,179
</font>
<font style="font-family:inherit;font-size:10pt;">
shares of common stock, $.01 par value per share, outstanding at
</font>
<font style="font-family:inherit;font-size:10pt;">
July 26, 2017
</font>
<font style="font-family:inherit;font-size:10pt;">
.
</font>
</div>
'''
I am attempting to locate a tag
based of text.
The text is a form of regex
which is located all within a div
tag.
month_pattern = r'((Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|(Nov|Dec)(?:ember)?)\s?(\d{1,2}\D?)\s?(19[7-9]\d|20\d{2}|\d{2}))'
word_pattern = r'(?=.*common)(?=.*outstanding[.,]?)(?=.*shares[.,]?)(?=.*stock[.,]?)'
pattern = word_pattern + '.*' + month_pattern
The above regex is slightly complicated, but it works when I test it as a stand-alone test on
on the text within div
.
With the soup code below, I'm expecting a return of a type of soup
object whose parent is the first div
tag,
however I am getting an empty list.
soup = bs(html, 'html.parser')
elem = soup(text=re.compile(pattern, flags = re.IGNORECASE|re.DOTALL))
print(elem)
results in
[]
I suspect this problem is because the div
's text is further nested within <font>
text? However, if I execute div.text
all of the text is printed out,
so I'm not sure why I am not getting any hits.
'''There were
33,012,179
shares of common stock, $.01 par value per share, outstanding at
July 26, 2017
.
'''
Once again, regex is not a problem as via re
module, I have:
print(re.search(pattern,text, flags = re.IGNORECASE|re.DOTALL))
with result:
<_sre.SRE_Match object; span=(0, 142), match='There were\n \n\n 33,012,179\n \n\n >
Excepted Result:
I am expecting elem
to be a non-empty list,
so that if I run elem.parent
as show here in accepted answer,
Using BeautifulSoup to find a HTML tag that contains certain text
I will be able to extract the first div
tag with its inner html as follows:
<div style="line-height:120%;padding-top:12px;text-align:left;text-indent:24px;font-size:10pt;">
<font style="font-family:inherit;font-size:10pt;">
Indicate by checkmark whether the registrant is a shell company (as defined in Rule 12b-2 of the Exchange Act). Yes
</font>
<font style="font-family:Wingdings;font-size:10pt;">
¨
</font>
<font style="font-family:inherit;font-size:10pt;">
No
</font>
<font style="font-family:Wingdings;font-size:10pt;">
x
</font>
</div>
However, I am getting back an empty list,
so elem.parent
returns nothing if I iterate
Thank you.
Here is the full code for easy c&p:
#testing_html
from bs4 import BeautifulSoup as bs
import re
import os
html = '''
<div style="line-height:120%;padding-top:12px;text-align:left;text-indent:24px;font-size:10pt;">
<font style="font-family:inherit;font-size:10pt;">
Indicate by checkmark whether the registrant is a shell company (as defined in Rule 12b-2 of the Exchange Act). Yes
</font>
<font style="font-family:Wingdings;font-size:10pt;">
¨
</font>
<font style="font-family:inherit;font-size:10pt;">
No
</font>
<font style="font-family:Wingdings;font-size:10pt;">
x
</font>
</div>
<div style="line-height:120%;padding-top:12px;text-align:left;font-size:10pt;">
<font style="font-family:inherit;font-size:10pt;">
There were
</font>
<font style="font-family:inherit;font-size:10pt;">
33,012,179
</font>
<font style="font-family:inherit;font-size:10pt;">
shares of common stock, $.01 par value per share, outstanding at
</font>
<font style="font-family:inherit;font-size:10pt;">
July 26, 2017
</font>
<font style="font-family:inherit;font-size:10pt;">
.
</font>
</div>
'''
text = '''There were
33,012,179
shares of common stock, $.01 par value per share, outstanding at
July 26, 2017
.
'''
month_pattern = r'((Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|(Nov|Dec)(?:ember)?)\s?(\d{1,2}\D?)\s?(19[7-9]\d|20\d{2}|\d{2}))'
word_pattern = r'(?=.*common)(?=.*outstanding[.,]?)(?=.*shares[.,]?)(?=.*stock[.,]?)'
pattern = word_pattern + '.*' + month_pattern
soup = bs(html, 'html.parser')
elem = soup(text=re.compile(pattern, flags = re.IGNORECASE|re.DOTALL))
print(elem)
print(re.search(pattern,text, flags = re.IGNORECASE|re.DOTALL))