1

I use Gmail API, I want to get all human-readable text from messages, and some messages are in MIME text/html format. Is there a "right" way to do it? I tried to use BeautifulSoup4 (I use Python), but sometimes text is missing because I use tag filtering, and otherwise, some text is not human readable. I used this example, and tried to fine-tune it: link

Maybe you know, how to do right parsing, or to use some Gmail API feature?

2 Answers2

1

Hmm, I'm not really sure, but at the moment code like in link (check question) works for me with small modifications, if it breaks, i'll write here

def tag_visible(element):
    if element.parent.name in ['style', 'script', 'head', 'title', 'meta', '[document]',  'yatag']:  # 'a'
        return False
    if isinstance(element, Comment):
        return False
    return True


def text_from_html(body):
    soup = BeautifulSoup(body, 'html.parser')
    texts = soup.findAll(text=True)
    visible_texts = filter(tag_visible, texts)
    return u" ".join(t.strip() for t in visible_texts)
0

As your using python we have package called html2text which will extract text from html but after the result you need to use some basic regex to eliminate the things like \n, \t, \r these will be present in the text which we extracted.

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459