How to cleanse a string of data so it can be used in Pandas / Converting one column into multiple columns

Question

I am trying to analyse WhatsApp by putting it into a Pandas dataframe, however it is only being read as a single column when I do enter it. What do I need to do to correct my error? I believe my error is due to how it needs to be formatted

I have tried to read it and then use Pandas to make it into columns, but because of how it is read, I believe it only sees one column. I have also tried to use pd.read_csv and that method does not yield the correct result either and the sep method too

The information from whatsapp is presented as follows in notebook:

[01/09/2017, 13:51:27] name1: abc
[02/09/2017, 13:51:28] name2: def
[03/09/2017, 13:51:29] name3: ghi
[04/09/2017, 13:51:30] name4: jkl
[05/09/2017, 13:51:31] name5: mno
[06/09/2017, 13:51:32] name6: pqr

The python code is as folows:

enter code here
import re
import sys
import pandas as pd
pd.set_option('display.max_rows', 500)

def read_history1(file):
  chat = open(file, 'r', encoding="utf8")


  #get all which exist in this format
  messages = re.findall('\d+/\d+/\d+, \d+:\d+:\d+\W .*: .*', chat.read())
  print(messages)
  chat.close()

  #make messages into a database
  history = pd.DataFrame(messages,columns=['Date','Time', 'Name', 
 'Message'])
  print(history)

  return history


#the encoding is added because of the way the file is written
#https://stackoverflow.com/questions/9233027/unicodedecodeerror-charmap- 
codec-cant-decode-byte-x-in-position-y-character/9233174

#i tried using sep, but it is not ideal for this data
def read_history2(file):
  messages = pd.read_csv(file)
  messages.columns = ['a','b]
  print(messages.head())
  return

filename = "AFC_Test.txt"
read_history2(filename)

The two methods I have tried are above. I expect 4 coluumns. The date, time, name and the message for each row

Some sample data would help us to understand your problem better. Please create a [mcve] with sample inputs and outputs. — G. Anderson, Jun 27 '19 at 18:57
I have added the infomation from whatsapp, apologies on missing that. If there is any further information required I will add as per request — Tejkaran Samra, Jun 27 '19 at 21:15

Ash Oldershaw · Answer 1 · 2019-06-28T09:33:40.877

0

So you can split each line into a set of strings, with code that might look a bit like this:

# read in file
with open(file, 'r', encoding="utf8") as chat:
    contents = chat.read()

# list for each line of the dataframe
rows = []

# clean data up into nice strings
for line in contents.splitlines():
    newline = line.split()
    for item in newline:
        item = item.strip("[],:")
    rows.append(line)


# create dataframe
history = pd.DataFrame(rows, columns=['Date','Time', 'Name', 'Message']

I think that should work!

Let me know how it goes :)

edited Jun 28 '19 at 09:33

answered Jun 27 '19 at 21:27

Ash Oldershaw

302
2
13

It says: TypeError: 'list' object is not callable - for item in newline() – Tejkaran Samra Jun 27 '19 at 23:10
1

Hi Tejkaran, I fixed the problem, it should work now. Purely for the sake of readability, I would suggest using the strip() method here as it regex is more difficult to read, although you do you! Have a nice day – Ash Oldershaw Jun 28 '19 at 09:34
As they say there are many ways to one solution, but thank you. I am learning at the moment, so I will try your way too. – Tejkaran Samra Jun 28 '19 at 14:51
No worries! All the best with your journey :) – Ash Oldershaw Jun 28 '19 at 16:02

score 0 · Answer 2 · answered Jun 28 '19 at 09:14

In case anyone comes across this I resolved it as follows: The error was in the regex

def read_history2(file):
    print('\n')
    chat = open(file, 'r', encoding="utf8")
    content = re.findall('\W(\d+/\d+/\d+), (\d+:\d+:\d+)\W (.*): (.*)', chat.read())
    history = pd.DataFrame(content, columns=['Date','Time', 'Name', 'Message'])
    print(history)

filename = "AFC_Test.txt"
read_history2(filename)

How to cleanse a string of data so it can be used in Pandas / Converting one column into multiple columns

2 Answers2