I was trying to follow a tutorial/example on how to import Whatsapp chat text exports into a Pandas dataframe, found here.
When I tried to run it, there was an encoding issue (UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 1123: character maps to <undefined>
) and a type error (TypeError: data argument can't be an iterator
that I addressed using this SO post).
However, for some reason, when I pass in the file exported from Whatsapp with encoding='utf8'
(I tried other options but the file is UTF-8), it just produces an empty dataframe.
When it didn't work, I found the Stack Overflow post the author created to get their code, which is this one. But it seems to work seamlessly and doesn't have any errors.
This is the code:
import pandas as pd
import re
def parse_file(text_file):
'''Convert WhatsApp chat log text file to a Pandas dataframe.'''
# some regex to account for messages taking up multiple lines
pat = re.compile(r'^(\d\d\/\d\d\/\d\d\d\d.*?)(?=^^\d\d\/\d\d\/\d\d\d\d|\Z)', re.S | re.M)
with open(text_file) as f:
data = [m.group(1).strip().replace('\n', ' ') for m in pat.finditer(f.read())]
sender = []; message = []; datetime = []
for row in data:
# timestamp is before the first dash
datetime.append(row.split(' - ')[0])
# sender is between am/pm, dash and colon
try:
s = re.search('m - (.*?):', row).group(1)
sender.append(s)
except:
sender.append('')
# message content is after the first colon
try:
message.append(row.split(': ', 1)[1])
except:
message.append('')
df = pd.DataFrame(zip(datetime, sender, message), columns=['timestamp', 'sender', 'message'])
df['timestamp'] = pd.to_datetime(df.timestamp, format='%d/%m/%Y, %I:%M %p')
# remove events not associated with a sender
df = df[df.sender != ''].reset_index(drop=True)
return df
df = parse_file('chat_data_anon.txt')
My expected results are the same as the author described in their SO post:
I have this:
06/01/2016, 10:40 pm - abcde
07/01/2016, 12:04 pm - abcde
07/01/2016, 12:05 pm - abcde
07/01/2016, 12:05 pm - abcde
07/01/2016, 6:14 pm - abcde
fghe
07/01/2016, 6:20 pm - abcde
07/01/2016, 7:58 pm - abcde
fghe
ijkl
07/01/2016, 7:58 pm - abcde
and want:
['06/01/2016, 10:40 pm - abcde\n',
'07/01/2016, 12:04 pm - abcde\n',
'07/01/2016, 12:05 pm - abcde\n',
'07/01/2016, 12:05 pm - abcde\n',
'07/01/2016, 6:14 pm - abcde\n\nfghe\n',
'07/01/2016, 6:20 pm - abcde\n',
'07/01/2016, 7:58 pm - abcde\n\nfghe\n\nijkl\n',
'07/01/2016, 7:58 pm - abcde\n']
... Except I only get an empty dataframe. When I broke it into pieces, it seems data
is empty. The file I passed is exactly how Whatsapp exported it (a simple .txt file), with no changes.
Can someone please tell me what I'm missing?