how to get readable text from Gmail message, if it contains html?

Question

I use Gmail API, I want to get all human-readable text from messages, and some messages are in MIME text/html format. Is there a "right" way to do it? I tried to use BeautifulSoup4 (I use Python), but sometimes text is missing because I use tag filtering, and otherwise, some text is not human readable. I used this example, and tried to fine-tune it: link

Maybe you know, how to do right parsing, or to use some Gmail API feature?

score 1 · Answer 1 · answered Mar 13 '20 at 15:16

1

Hmm, I'm not really sure, but at the moment code like in link (check question) works for me with small modifications, if it breaks, i'll write here

def tag_visible(element):
    if element.parent.name in ['style', 'script', 'head', 'title', 'meta', '[document]',  'yatag']:  # 'a'
        return False
    if isinstance(element, Comment):
        return False
    return True


def text_from_html(body):
    soup = BeautifulSoup(body, 'html.parser')
    texts = soup.findAll(text=True)
    visible_texts = filter(tag_visible, texts)
    return u" ".join(t.strip() for t in visible_texts)

answered Mar 13 '20 at 15:16

ProtsenkoAI

51
6

of course, even this works now, I want more stable solution of this task – ProtsenkoAI Mar 13 '20 at 15:18
Gmail doesn't parse the html, it gets the snippet in a simple string. If `BeautifulSoup` already does it for you, I don't think it can be done easier. – Jescanellas Mar 16 '20 at 09:34

score 0 · Answer 2 · edited Jul 25 '20 at 08:42

0

As your using python we have package called html2text which will extract text from html but after the result you need to use some basic regex to eliminate the things like \n, \t, \r these will be present in the text which we extracted.

edited Jul 25 '20 at 08:42

marc_s

732,580
175
1,330
1,459

answered Mar 19 '20 at 09:25

Vishnuvardhan N

29
4

Thank you for your answer, I'll check this one and write later – ProtsenkoAI Mar 26 '20 at 06:20

how to get readable text from Gmail message, if it contains html?

2 Answers2