I have a problem that I require help with. I wish to extract text with certain features from html and put them into a list, specifically: ALL words that are bold and that have quotes around them eg.
"Word"
In HTML that will be :
This is actually a very complicated sentence ("<strong>CS</strong>"), I hope you understand it.
I wish to extract the word 'CS' and put it into a list ['CS'].
This is what I have at the moment, note that I'm converting a word document into HTML format and extracting texts from the HTML file:
with open(r'file path.docx', 'rb') as file:
html = mammoth.convert_to_html(file).value
result =re.findall('"<strong>(.*?)</strong>"',html)
But I seem to have a bit of trouble as this doesn't yield all the results that I want.
Thanks guys for your help! I know that there is a package called BeautifulSoup if you could tell me how that works in this case, it would be great as well!