0

I am cleaning data from a .txt source. The file is including WhatsApp messages in every line, including date and time stamp. I already split all of that into one column holding data and time information df['text] and one column holding all the text data df['text_new']. Based on this I want to create a word cloud. This is why I need every word from the several conversations as single entries in seperate pandas data frame entries.

I need your help for further cleaning and transformtation of this data.

Let's suppose the data frame column df['text_new'] is this:

0    How are you? 
1    I am fine, we should meet this afternoon!
2    Okay let us do that. 

What do I want to do?

  1. Clean every punctuations out of the text.
  2. Split the messages in seperate words, so that only one word is in one dataframe entry.
  3. If it is possible, one smiley should be considered as a single word. If this it not possible, how to clean them out?
  4. Make every text lower case. There is already a solution for that, but it would be really nice to include it into the "cleaning code".

Now that you know the three steps I want to run, maybe someone has a clean and neat way to do that.

Thank you all in advance!

Mike_H
  • 1,343
  • 1
  • 14
  • 31
  • 2
    You want `df.text_new.str.lower()`. – Jacob Tomlinson Dec 13 '18 at 13:25
  • Thank you on this problem. I get the error "Wrong number of items passed 2, placement implies 8362". That might be because of the smileys and why i want to split the text before I get all text to lower case. If the smileys are the roor cause for this error. I would need a hint to clean them out as well. – Mike_H Dec 13 '18 at 13:29
  • 1
    @jpp Thank you, I edited my question. However my primary concern is on everything else than getting the text lower case. So it is an answer to one of my four questions regarding the cleaning. Can you please reopen my question? – Mike_H Dec 13 '18 at 13:34
  • @jezrael Thank you for assistance. My biggest problem is with the other 3 steps before getting all text lower case. – Mike_H Dec 13 '18 at 13:36

1 Answers1

2

Use:

import re

#https://stackoverflow.com/a/49146722
emoji_pattern = re.compile("["
                       u"\U0001F600-\U0001F64F"  # emoticons
                       u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                       u"\U0001F680-\U0001F6FF"  # transport & map symbols
                       u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                       u"\U00002702-\U000027B0"
                       u"\U000024C2-\U0001F251"
                       "]+", flags=re.UNICODE)

df['new'] = (df['text_new'].str.lower() #lowercase
                           .str.replace(r'[^\w\s]+', '') #rem punctuation 
                           .str.replace(emoji_pattern, '') #rem emoji
                           .str.strip() #rem trailing whitespaces
                           .str.split()) #split by whitespaces

Sample:

df = pd.DataFrame({'text_new':['How are you?',
                               'I am fine, we should meet this afternoon!',
                               'Okay let us do that. \U0001f602']})


emoji_pattern = re.compile("["
                       u"\U0001F600-\U0001F64F"  # emoticons
                       u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                       u"\U0001F680-\U0001F6FF"  # transport & map symbols
                       u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                       u"\U00002702-\U000027B0"
                       u"\U000024C2-\U0001F251"
                       "]+", flags=re.UNICODE)

import re


df['new'] = (df['text_new'].str.lower()
                           .str.replace(r'[^\w\s]+', '')
                           .str.replace(emoji_pattern, '')
                           .str.strip()
                           .str.split())
print (df)
                                    text_new  \
0                               How are you?   
1  I am fine, we should meet this afternoon!   
2                     Okay let us do that.    

                                                new  
0                                   [how, are, you]  
1  [i, am, fine, we, should, meet, this, afternoon]  
2                         [okay, let, us, do, that] 

EDIT:

df['new'] = (df['text_new'].str.lower()
                           .str.replace(r'[^\w\s]+', '')
                           .str.replace(emoji_pattern, '')
                           .str.strip())
print (df)
                                    text_new  \
0                               How are you?   
1  I am fine, we should meet this afternoon!   
2                     Okay let us do that.    

                                       new  
0                              how are you  
1  i am fine we should meet this afternoon  
2                      okay let us do that 
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Thank you, that is propably very close to what I need. I get this errors now: IndentationError: unexpected indent & 're' is not defined Maybe there is a way to delete all emojis from the text? – Mike_H Dec 13 '18 at 13:38
  • Okay it works partially, but now every entry looks like this: [sleep, well!] [we, should, meet, tomorrow?] So dots and emojis are gone (great!), but the text stil contains comma from the regular messages that have to ge deleted before we run this comment: ".str.replace(r'[^\w\s]+', '') #rem punctuation ". How could I do that? – Mike_H Dec 13 '18 at 13:47
  • 1
    Okay I just saw your edit! It looks right what I have. How could we split the words in the sentences to one word per dataframe entry? That's my goal and the last thing missing to fully solve my problem! Thank you very much so far!! – Mike_H Dec 13 '18 at 13:52
  • So I need every single word, i.e. in the array [how, are, you], [okay, let, us, do, that] to be transformed that the array only contains a list of the single words: how, are, you, okay, let, us, do, that Every word should be in a new entry and withour brackets around them. Maybe this is clarifying what it need. Thank you in advance! – Mike_H Dec 13 '18 at 14:09
  • @Mike_H - S need remove `.str.split()` ? Or need big list with all words from all sentences like `s = (df['text_new'].str.lower() .str.replace(r'[^\w\s]+', '') .str.replace(emoji_pattern, '') .str.strip() .str.split())` and `L = [y for x in s for y in x]` ? – jezrael Dec 13 '18 at 14:12
  • Can you maybe include it into your answer? If I change it in my code, nothing really changes. I'm sure I did something wrong. – Mike_H Dec 13 '18 at 14:18
  • @Mike_H - so need reove only `.str.split()` ? – jezrael Dec 13 '18 at 14:24
  • That does not change anything for me, if i just remove it from the code. It's weird. – Mike_H Dec 13 '18 at 14:28
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/185214/discussion-between-mike-h-and-jezrael). – Mike_H Dec 13 '18 at 14:32