7

Using Canopy and Pandas, I have data frame a which is defined by:

a=pd.read_csv('text.txt')

df=pd.DataFrame(a)

df.columns=["test"]

test.txt is a single column file that contains a list of string that contains text, numerical and punctuation.

Assuming df looks like:


test

%hgh&12

abc123!!!

porkyfries


I want my results to be:


test

hgh12

abc123

porkyfries


Effort so far:

from string import punctuation /-- import punctuation list from python itself

a=pd.read_csv('text.txt')

df=pd.DataFrame(a)

df.columns=["test"] /-- define the dataframe


for p in list(punctuation):

     ...:     df2=df.med.str.replace(p,'')

     ...:     df2=pd.DataFrame(df2);

     ...:     df2

The command above basically just returns me with the same data set. Appreciate any leads.

Edit: Reason why I am using Pandas is because data is huge, spanning to bout 1M rows, and future usage of the coding will be applied to list that go up to 30M rows. Long story short, I need to clean data in a very efficient manner for big data sets.

Community
  • 1
  • 1
BernardL
  • 5,162
  • 7
  • 28
  • 47
  • is all the data text or are there any numbers? for example, if someone types 3.14 as a string, do you really want to strip the period? – philshem Feb 10 '14 at 09:35
  • @philshem yes for this particular case. Data size is huge and possibly spanning over millions of rows. This is the very reason why I am discounting those that have punctuation and such. Think of, "Paracetemol 50mg 10% Discount" should only return me "Paracetemol". Similarly, typos, "Actife@4d" should return "Actife4d", as first level of filter. – BernardL Feb 10 '14 at 10:55

3 Answers3

5

Use replace with correct regex would be easier:

In [41]:

import pandas as pd
pd.set_option('display.notebook_repr_html', False)
df = pd.DataFrame({'text':['test','%hgh&12','abc123!!!','porkyfries']})
df
Out[41]:
         text
0        test
1     %hgh&12
2   abc123!!!
3  porkyfries

[4 rows x 1 columns]

use regex with the pattern which means not alphanumeric/whitespace

In [49]:

df['text'] = df['text'].str.replace('[^\w\s]','')
df
Out[49]:
         text
0        test
1       hgh12
2      abc123
3  porkyfries

[4 rows x 1 columns]
EdChum
  • 376,765
  • 198
  • 813
  • 562
  • Hmmm, that makes sense. But probably gotta figure out the reason why & was not omitted. – BernardL Feb 10 '14 at 09:29
  • @user3288092 just read the docs, `strip` removes characters at beginning and end, hence the error, you should use `replace` – EdChum Feb 10 '14 at 09:34
  • What about unicode punctuation, such as the many types of dashes? https://www.cs.tut.fi/~jkorpela/dashes.html#unidash – philshem Feb 10 '14 at 09:38
  • @user3288092 I've updated my answer after figuring out what I think the correct regex pattern should be, it works for this limited sample data – EdChum Feb 10 '14 at 09:41
  • @EdChum Hi thanks for your answer, I assume that will work. But I am looking forward to store values that I continually need to replace in a list and run that command every time. The list will continue to grow as a filter. This method has to be efficient and also allowed to be updated easily. – BernardL Feb 10 '14 at 11:05
5

For removing punctuation from a text column in your dataframme:

In:

import re
import string
rem = string.punctuation
pattern = r"[{}]".format(rem)

pattern

Out:

'[!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~]'

In:

df = pd.DataFrame({'text':['book...regh', 'book...', 'boo,', 'book. ', 'ball, ', 'ballnroll"', '"rope"', 'rick % ']})
df

Out:

        text
0  book...regh
1      book...
2         boo,
3       book. 
4       ball, 
5   ballnroll"
6       "rope"
7      rick % 

In:

df['text'] = df['text'].str.replace(pattern, '')
df

You can replace the pattern with your desired character. Ex - replace(pattern, '$')

Out:

        text
0   bookregh
1       book
2        boo
3      book 
4      ball 
5  ballnroll
6       rope
7     rick  
Aakash Saxena
  • 199
  • 3
  • 8
1

Translate is often considered the cleanest and fastest way to remove punctuation (source)

import string
text = text.translate(None, string.punctuation.translate(None, '"'))

You may find that it works better to remove punctuation in 'a' before loading it into pandas.

Community
  • 1
  • 1
philshem
  • 24,761
  • 8
  • 61
  • 127
  • It is returning me with an error that DataFrame does not have a 'translate; attribute. Sorry, also need to mention data is huge, thats why trying to implement on Pandas. – BernardL Feb 10 '14 at 09:28