0

I have a simple email in Gmail that looks like this:

Hi all

@alice - please prepare XXX for tomorrow
@bob - please prepare YYY for tomorrow

best,
Z

and I would like to fetch it, parse it and split by newline, so I would get a list of 5 elements:

['Hi all','@alice ...', '@bob ...', 'best,','Z']

but for some reason inside the sentence I get \r\n which makes me break the line into 2 lines although in the original email there wasn't new line.

I parse it as following (after getting the proper credentials)

txt = service.users().messages().get(userId=user.email, id=email_msg['id']).execute()
payload = txt["payload"]
headers = payload["headers"]

parts = payload.get("parts")[0]
data = parts["body"]["data"]
data = data.replace("-", "+").replace("_", "/")
decoded_message = str(base64.b64decode(data), "utf-8")
split = decoded_message.splitlines()
final_split = list(filter(None, split))

but then the message I get looks like this:

Hi all\r\n\r\n@alice - please prepare XXX\r\nfor tomorrow\r\n@bob - please prepare YYY for tomorrow\\r\nr\nbest,\n\rZ

so if I split by \r\n or \n I get invalid result

Alex L
  • 1,069
  • 2
  • 18
  • 33

2 Answers2

1

When you decode the data using b64decode() you don't get a string, instead you get a byte string. Here's an excellent explanation of the difference. Before trying to parse the message you have to convert it into a regular string.

You can do this by running .decode("utf-8"). Then you can just use .splitlines() to split the message.

txt = service.users().messages().get(userId=user.email, id=email_msg['id']).execute()
payload = txt["payload"]
headers = payload["headers"]

parts = payload.get("parts")[0]
data = parts["body"]["data"]
data = data.replace("-", "+").replace("_", "/")
decoded_data = base64.b64decode(data)

decoded_message = decoded_data.decode("utf-8") # decodes the byte string

split = decode_message.splitlines() # splits the message into a list

final_split = list(filter(None, split)) # this removes the blank lines

Running .decode() on the message will change it from this:

Hi all\r\n\r\n@alice - please prepare XXX\r\nfor tomorrow\r\n@bob - please prepare YYY for tomorrow\\r\nr\nbest,\n\rZ

To the original message:

Hi all

@alice - please prepare XXX for tomorrow
@bob - please prepare YYY for tomorrow

best,
Z

Then after .splitlines() you will get this list:

['Hi all', '', '@alice...', '@bob...', '', 'best,', 'Z']

Note that there are blank strings that correspond to the blank lines. To get rid of them you can run the last line final_split = list(filter(None, split)), which will give you what you're looking for. There are other methods as well:

['Hi all', '@alice...', '@bob...', 'best,', 'Z']

By the way, I did not install BeautifulSoup for this, but if you want to use it you probably want to add it after you decode the byte string.

Daniel
  • 3,157
  • 2
  • 7
  • 15
  • thanks for the specified answer! I tried adding `decoded_data.decode("utf-8")` after `b64decode`, but it did not change anything - the `\r\n` are still there – Alex L May 17 '22 at 06:06
  • Can you post the modified code? – Daniel May 17 '22 at 16:00
  • as you suggested, after `decoded_data = base64.b64decode(data)` I added `decoded_message = decoded_data.decode("utf-8")`. But I still see '\r\n' – Alex L May 17 '22 at 18:53
  • That's odd, you're using Python 3, right? Are you running the data through BeautifulSoup before trying to decode it? It's possible that the soup turns it into a normal string without parsing out the escape sequence first. – Daniel May 17 '22 at 20:16
  • Using python `3.9.10`. I use `BeautifulSoup` after decoding. – Alex L May 17 '22 at 20:20
  • How about doing it in one go. Try `decoded_data=str(base64.b64decode(data), "utf-8")`. You can skip the `.decode()` since that should do it right away. – Daniel May 17 '22 at 20:32
  • Same result. It seems that Gmail somehow add random `\r\n` in the emails – Alex L May 17 '22 at 20:41
  • It's not really Gmail adding it but the result of the Python function. Can you edit your original post with your current code? Maybe there's something missing. – Daniel May 17 '22 at 20:59
  • added the latest code – Alex L May 17 '22 at 21:04
  • It's strange. I pretty much copied your code to test and it works for me. We can try to track down the issue at the source. Try to get the base64 email in the `data` variable and plug it into https://www.base64decode.org/ instead of using Python. Is the output the same with the escape sequences? – Daniel May 17 '22 at 21:40
  • As you suggested I checked the data in `base64decode.org`: `SGkgQXJpZWwgaG93IGFyZSB5b3U/DQpJ4oCZZCBsaWtlIHlvdSB0byBjaGVjayB0aGUgY3VzdG9tZXJzIHNpdGUgdG9tb3Jyb3cgYXQgOSBBTSBJ4oCZZCBsaWtlIHlvdSB0bw0KY2hlY2sgdGhlIGN1c3RvbWVycyBJ4oCZZCBsaWtlIHlvdSB0byBjaGVjayB0aGUgY3VzdG9tZXJzDQo=`. I see the line breaks after `like you to`, but in the email I do not have new line there – Alex L May 18 '22 at 20:12
  • The reason for that newline is that the message is also encoded in [quoted-printable](https://en.wikipedia.org/wiki/Quoted-printable) format, which adds line breaks every roughly 76 characters. Python has a `quopri` module to decode this but in my tests I was unsuccessful trying to remove that line. I have another suggestion, `parts = payload.get("parts")[0]` gets the message in plain text, but `parts = payload.get("parts")[1]` gets it in HTML format, which you can try to run through BeautifulSoup. Maybe this was your original purpose? You'll still have to decode the byte string, by the way. – Daniel May 19 '22 at 02:21
  • I tried to do to parse `payload.get("parts")[1]` as `HTML` but without any success – Alex L May 19 '22 at 19:43
  • Why? What went wrong in that case? – Daniel May 20 '22 at 01:38
  • The decoded `html` looks like this: `'
    Hi Ariel how are you?
    I’d \r\nlike you to check the customers site tomorrow at 9 AM I’d like you to \r\ncheck the customers I’d like you to check the customers
    Also we need you to check with the client if she's available on the next Monday morning to meet with us on the site

    Best,
    Bob
    \r\n'`. I still get `\r\n`.
    – Alex L May 20 '22 at 06:46
  • It's strange because in my environment decoding the string worked with my tests and even with the sample base64 string that you provided. Unfortunately I'm not an expert on this so you may want to post another question asking why the python `decode` would not work for you. – Daniel May 23 '22 at 11:51
1

As was suggested in the comment by Daniel, I used the HTML data in order to extract the message correctly:

I defined the HTML parser:

from html.parser import HTMLParser
from io import StringIO

def extract_text(html_text: str) -> str:
    class MLStripper(HTMLParser):
        def __init__(self):
            super().__init__()
            self.reset()
            self.strict = False
            self.convert_charrefs = True
            self.text = StringIO()

        def handle_data(self, d):
            self.text.write(d)

        def get_data(self):
            return self.text.getvalue()

    def strip_tags(html):
        s = MLStripper()
        s.feed(html)
        return s.get_data()

    cleaned_html_text = html_text.replace('</div>', '\n</div>').replace('\r\n', '')\
        .replace('<br>', '\n').replace('\xa0', ' ')
    return strip_tags(cleaned_html_text)```

and then run it on the HTML:

parts = payload.get("parts")[1] # take the HTML part
data = parts["body"]["data"]
data = data.replace("-", "+").replace("_", "/")
decoded_message = str(base64.b64decode(data), "utf-8")
extracted_message = extract_text(html_text=decoded_message)
Alex L
  • 1,069
  • 2
  • 18
  • 33