1

I am trying to write a parser for the conversation log of WhatsApp. A minimal log file at the end of the question.

In this log, there are two kind of message, the normal ones, where the syntax is

date time: Name: Message

As you can see, the Message could go to newline, and the name could contain :.

The second kind of messages are "event" messages, which could be of the following types:

date time: Name joined
date time: Name left
date time: Name was removed
date time: Name changed the subject to “GroupName”
date time: Name changed the group icon

I tried to write down some regex, but the difficulties that I encountered are several: how to handle multiline messages, how to parse Name field (because splitting on : does not work), how to build a regex that recognize messages only from senders that currently are in the group and finally how to parse the special messages (for example, parsing searching for joined as last word it is not a good idea).

How can I parse such a log file and move everything to a dictionary?

More precisely,to answer the question in the comment, the output i was thinking about was something like a nested dict: where in the first level the keys are each sender, on the second level made a distinction between 'Events' (such join, left etc.) and 'Message', and putting everything as a list of tuples.

>>>datab[Sender1]['Events']
>>>[('Joined',data1,time1),('Left',data2,time2]

>>>datab[Sender2]['Messages']
>>>[(data1,time1,Message1),(data2,time2,Message2)]

But if you could think of a more intelligent format, go for it!

29/03/14 15:48:05: John Smith changed the subject to “Test”

29/03/14 16:10:39: John Smith joined

29/03/14 16:10:40: Person:2 joined

29/03/14 16:10:40: John Smith: Hello!

29/03/14 16:11:40: Person:2: some random words,

29/03/14 16:12:40: Person3 joined

29/03/14 16:13:40: John Smith: Hello!Test message with newline
Another line of the same message
Another line.

29/03/14 16:14:43: Person:2: Test message using as last word joined

29/03/14 16:15:57: Person3 left

29/03/14 16:17:16: Person3 joined

29/03/14 16:18:21: Person:2 changed the group icon

29/03/14 16:19:16: Person3 was removed 

29/03/14 16:20:43: Person:2: Test message using as last word left
Pierpaolo
  • 1,721
  • 4
  • 20
  • 34

2 Answers2

4

You can use this pattern:

(?P<datetime>\d{2}\/\d{2}\/\d{2} \d{2}:\d{2}:\d{2}): (?P<name>\w+(?::\s*\w+)*|[\w\s]+?)(?:\s+(?P<action>joined|left|was removed|changed the (?:subject to “\w+”|group icon))|:\s(?P<message>(?:.+|\n(?!\n))+))

demo

To deal with multiline message, I forbid with a negative lookahead consecutive newline characters. However, you can make the pattern more tolerant by adding the start of the next block or the end of the string in the lookahead after the \n

Casimir et Hippolyte
  • 88,009
  • 5
  • 94
  • 125
  • It seems very good, but I found out another test case that does not work. If we put a space after : in a name, such "John: Smith", the regex breaks – Pierpaolo Jul 08 '14 at 11:29
  • This didn't work for a group export I did today. So here's my edited version for anyone else looking for it. ```(?P\d{4}\-\d{2}\-\d{2}\,\s*\d{1}:\d{2} [AaPp]\.[Mm]\.)\s*\-\s*(?P\w+(?::\s*\w+)*|[\w\s]+?)\:\s(?:\s+(?Pjoined|left|was removed|changed the (?:subject to “\w+â€|group icon))|(?P(?:.+|\n(?!\d{4}\-\d{2}\-\d{2}\,\s*\d{1}:\d{2} [AaPp]\.[Mm]\.))+))``` – shachna May 14 '19 at 15:15
  • 1
    @shachna: if you want to describe a datetime format with am and pm, you have to replace `,\s*\d{1}:\d{2}` with `,\s*\d{1,2}:\d{2}` to allow hours greater than 9. (Note that everytime, everywhere, the quantifier `{1}` is never needed since each element in a pattern occurs one time by default.). `-` and `:` aren't special characters and don't need to be escaped. Also `:\s(?:\s+(?P` needs at least to consecutive whitespaces to succeed, are you sure of this part? Perhaps `:\s(?:\s*(?P`? – Casimir et Hippolyte May 14 '19 at 19:03
  • @CasimiretHippolyte Thanks for the time catch. I'm not sure about the white space bit. I've been editing my python script all day and I think the only problem with that version was the missing time bit. – shachna May 14 '19 at 23:45
1

Late Entry.

@Casimir's answer is from 2014. Now the format of Whatsapp messages has changed a bit. Following is the corrected regex but I've not covered the special messages part (joined, left, changed the subject, etc.)

(?<datetime>\d{1,2}\/\d{1,2}\/\d{1,4}, \d{1,2}:\d{1,2}( (?i)[ap]m)*) - (?<name>.*(?::\s*\w+)*|[\w\s]+?)(?:\s+(?<action>joined|left|was removed|changed the (?:subject to "\w+"|group's icon))|:\s(?<message>(?:.+|\n(?!\d{1,2}\/\d{1,2}\/\d{1,4}, \d{1,2}:\d{1,2}( (?i)[ap]m)*))+))
Pankaj Singhal
  • 15,283
  • 9
  • 47
  • 86