Did something change in the encoding of Whatsapp chat exports?

Question

I was trying to follow a tutorial/example on how to import Whatsapp chat text exports into a Pandas dataframe, found here.

When I tried to run it, there was an encoding issue (UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 1123: character maps to <undefined>) and a type error (TypeError: data argument can't be an iterator that I addressed using this SO post).

However, for some reason, when I pass in the file exported from Whatsapp with encoding='utf8' (I tried other options but the file is UTF-8), it just produces an empty dataframe.

When it didn't work, I found the Stack Overflow post the author created to get their code, which is this one. But it seems to work seamlessly and doesn't have any errors.

This is the code:

import pandas as pd
import re

def parse_file(text_file):
    '''Convert WhatsApp chat log text file to a Pandas dataframe.'''

    # some regex to account for messages taking up multiple lines
    pat = re.compile(r'^(\d\d\/\d\d\/\d\d\d\d.*?)(?=^^\d\d\/\d\d\/\d\d\d\d|\Z)', re.S | re.M)
    with open(text_file) as f:
        data = [m.group(1).strip().replace('\n', ' ') for m in pat.finditer(f.read())]

    sender = []; message = []; datetime = []
    for row in data:

        # timestamp is before the first dash
        datetime.append(row.split(' - ')[0])

        # sender is between am/pm, dash and colon
        try:
            s = re.search('m - (.*?):', row).group(1)
            sender.append(s)
        except:
            sender.append('')

        # message content is after the first colon
        try:
            message.append(row.split(': ', 1)[1])
        except:
            message.append('')

    df = pd.DataFrame(zip(datetime, sender, message), columns=['timestamp', 'sender', 'message'])
    df['timestamp'] = pd.to_datetime(df.timestamp, format='%d/%m/%Y, %I:%M %p')

    # remove events not associated with a sender
    df = df[df.sender != ''].reset_index(drop=True)

    return df

df = parse_file('chat_data_anon.txt')

My expected results are the same as the author described in their SO post:

I have this:

06/01/2016, 10:40 pm - abcde
07/01/2016, 12:04 pm - abcde
07/01/2016, 12:05 pm - abcde
07/01/2016, 12:05 pm - abcde
07/01/2016, 6:14 pm - abcde

fghe
07/01/2016, 6:20 pm - abcde
07/01/2016, 7:58 pm - abcde

fghe

ijkl
07/01/2016, 7:58 pm - abcde

and want:

['06/01/2016, 10:40 pm - abcde\n',
 '07/01/2016, 12:04 pm - abcde\n',
 '07/01/2016, 12:05 pm - abcde\n',
 '07/01/2016, 12:05 pm - abcde\n',
 '07/01/2016, 6:14 pm - abcde\n\nfghe\n',
 '07/01/2016, 6:20 pm - abcde\n',
 '07/01/2016, 7:58 pm - abcde\n\nfghe\n\nijkl\n',
 '07/01/2016, 7:58 pm - abcde\n']

... Except I only get an empty dataframe. When I broke it into pieces, it seems data is empty. The file I passed is exactly how Whatsapp exported it (a simple .txt file), with no changes.

Can someone please tell me what I'm missing?

Will replacing ```return df``` with ```return df.tolist()``` do the trick for you? — Grzegorz Skibinski, Dec 19 '19 at 05:46

score 0 · Answer 1 · answered Apr 06 '20 at 05:20

0

My friend, what i did and worked for me was first read my .txt file... example:

opened_file = open("file.txt", encoding="utf8").read()

therefore you can work with opened_file .

answered Apr 06 '20 at 05:20

Thiago Valente

1
1

score 0 · Answer 2 · answered Sep 09 '20 at 01:54

I made 3 small changes and the code is now working well for me:

1- The format of the dates don't always have two digits for days and months, but it always has two digits for years. I adjusted the regex to reflect it:

r'^(\d+/\d+/\d\d.*?)(?=^^\d+/\d+/\d\d,*?)'

2- The end of the datatime field has either AM or PM in capital letters:

s = re.search('M - (.*?):', row).group(1)

3 - The datetime format is actually month/day/year:

df['timestamp'] = pd.to_datetime(df.timestamp, format='%m/%d/%y, %I:%M %p')

import pandas as pd
import re

def parse_file(FULL_PATH):
    '''Convert WhatsApp chat log text file to a Pandas dataframe.'''

    # some regex to account for messages taking up multiple lines
    pat = re.compile(r'^(\d+\/\d+\/\d\d.*?)(?=^^\d+\/\d+\/\d\d\,\*?)', re.S | re.M)
    with open(FULL_PATH, encoding = 'utf8') as raw:
        data = [m.group(1).strip().replace('\n', ' ') for m in pat.finditer(raw.read())]
    
    sender = []; message = []; datetime = []
    for row in data:

        # timestamp is before the first dash
        datetime.append(row.split(' - ')[0])

        # sender is between am/pm, dash and colon
        try:
            s = re.search('M - (.*?):', row).group(1)
            sender.append(s)
        except:
            sender.append('')

        # message content is after the first colon
        try:
            message.append(row.split(': ', 1)[1])
        except:
            message.append('')

    df = pd.DataFrame(zip(datetime, sender, message), columns=['timestamp', 'sender', 'message'])
    df['timestamp'] = pd.to_datetime(df.timestamp, format='%m/%d/%y, %I:%M %p')

    # remove events not associated with a sender
    df = df[df.sender != ''].reset_index(drop=True)

    return df

df = parse_file(FULL_PATH)

score -1 · Answer 3 · answered Dec 19 '19 at 05:22

-1

Just had the same problem. It looks like whatsapp extract format different - at least for me it is now something like:

[dd/mm/yy, hh:mm:ss:] Sender: Message

answered Dec 19 '19 at 05:22

Roger

1

Did something change in the encoding of Whatsapp chat exports?

3 Answers3