0

I have a problem that I require help with. I wish to extract text with certain features from html and put them into a list, specifically: ALL words that are bold and that have quotes around them eg.

"Word"

In HTML that will be :

This is actually a very complicated sentence ("<strong>CS</strong>"), I hope you understand it.

I wish to extract the word 'CS' and put it into a list ['CS'].

This is what I have at the moment, note that I'm converting a word document into HTML format and extracting texts from the HTML file:

with open(r'file path.docx', 'rb') as file:
    html = mammoth.convert_to_html(file).value
    result =re.findall('&quot;<strong>(.*?)</strong>&quot;',html)

But I seem to have a bit of trouble as this doesn't yield all the results that I want.

Thanks guys for your help! I know that there is a package called BeautifulSoup if you could tell me how that works in this case, it would be great as well!

Dazz W
  • 113
  • 8
  • Can you provide an example of a string where it doesn't find the desired pattern (since @CC7052 answer looks good)? – DarrylG Jan 27 '20 at 14:42
  • for instance, in this sentence, none of the words are identified: I have been instructed by Business Advisory Services LLP (“BAS”), before the High Court of Justice – (the “Court”)." – Dazz W Jan 27 '20 at 14:51
  • The duplicate question does not answer question. The user is already using a regex to find a pattern. The real problem is the quote style in the user's pattern string is different from the quote style in the string. The string is using quotes from `quote_list = ['“', '”']`, which corresponds to Unicode 8220 and 8221 while the pattern is using the quote style `"` which corresponds to Unicode 34. This can be fixed by replacing quotes in the string that uses quotes from quote_list with `"`. – DarrylG Jan 27 '20 at 15:23
  • [Example of previous comment](https://stackoverflow.com/questions/33782513/find-and-replace-both-quotation-styles-in-python-unicoded-string) – DarrylG Jan 27 '20 at 15:29

0 Answers0