I am trying to write a parser for the conversation log of WhatsApp. A minimal log file at the end of the question.
In this log, there are two kind of message, the normal ones, where the syntax is
date time: Name: Message
As you can see, the Message
could go to newline, and the name could contain :
.
The second kind of messages are "event" messages, which could be of the following types:
date time: Name joined
date time: Name left
date time: Name was removed
date time: Name changed the subject to “GroupName”
date time: Name changed the group icon
I tried to write down some regex, but the difficulties that I encountered are several: how to handle multiline messages, how to parse Name
field (because splitting on :
does not work), how to build a regex that recognize messages only from senders that currently are in the group and finally how to parse the special messages (for example, parsing searching for joined as last word it is not a good idea).
How can I parse such a log file and move everything to a dictionary?
More precisely,to answer the question in the comment, the output i was thinking about was something like a nested dict: where in the first level the keys are each sender, on the second level made a distinction between 'Events' (such join, left etc.) and 'Message', and putting everything as a list of tuples.
>>>datab[Sender1]['Events']
>>>[('Joined',data1,time1),('Left',data2,time2]
>>>datab[Sender2]['Messages']
>>>[(data1,time1,Message1),(data2,time2,Message2)]
But if you could think of a more intelligent format, go for it!
29/03/14 15:48:05: John Smith changed the subject to “Test”
29/03/14 16:10:39: John Smith joined
29/03/14 16:10:40: Person:2 joined
29/03/14 16:10:40: John Smith: Hello!
29/03/14 16:11:40: Person:2: some random words,
29/03/14 16:12:40: Person3 joined
29/03/14 16:13:40: John Smith: Hello!Test message with newline
Another line of the same message
Another line.
29/03/14 16:14:43: Person:2: Test message using as last word joined
29/03/14 16:15:57: Person3 left
29/03/14 16:17:16: Person3 joined
29/03/14 16:18:21: Person:2 changed the group icon
29/03/14 16:19:16: Person3 was removed
29/03/14 16:20:43: Person:2: Test message using as last word left