Python: Removing non-ascii characters from CSV file using pandas

Question

I have a csv file of text SMS in utf-8 encoding.

import pandas as pd
data = pd.read_csv('my_data.csv', sep=',')
data.head()

It has output like:

id    city    department    sms                    category
01    khi      revenue      quk respns.                1
02    lhr      revenue      good.                      1
03    lhr      revenue      †h\0h2h\0hh\               0
04    isb      accounts     ?xœ1øiûüð÷üœç8i            0
05    isb      accounts     %â¡ã‘ã¸$ãªã±t%rã«ãÿã©â£    0

I want to remove all the records/rows where sms column has garbage values such as in record 3,4 and 5. May be they were written in a language other than English I am not so sure what happened to these records. Record 1 and 2 are okay to keep although language used in sms column is informal (as people normally do in text messages). What would be the convenient way to achieve that given that I have around 2 million records.

Edit: I want to remove any row with non-ascii characters in sms column.

Can you clarify what the full requirement is, for instance are you trying to filter out stuff that is not all English? Not ascii? — EdChum, Sep 07 '17 at 10:58
I want to filter all records whose "sms" value is non-ascii. — Haroon S., Sep 07 '17 at 11:02
[GSM 03.38](https://www.csoft.co.uk/support/character-sets#gsm_set) supports non-ascii characters so I'm not sure it's valid to restrict it to just ascii — EdChum, Sep 07 '17 at 11:03
@SalA. You might or might not want to keep certain non-ascii characters, but it solely depends on your use case. Ed Chum's answer retains them, my answer drops them. Carefully think and decide what it is you really need. — cs95, Sep 07 '17 at 11:04
Additionally it's possible to support other languages too https://en.wikipedia.org/wiki/GSM_03.38 — EdChum, Sep 07 '17 at 11:05
See my updated answer, this does the filtering in a single pass — EdChum, Sep 07 '17 at 11:25

EdChum · Accepted Answer · 2017-09-07T12:11:16.980

9

Use str.isalnum to filter the rows that are not alphanumeric:

In[46]:
df[df['sms'].str.isalnum()]

Out[46]: 
   id city department          sms  category
0   1  khi    revenue  quk respns.         1
1   2  lhr    revenue        good.         1

update

if you really want to filter out any rows containing non-ascii characters then you can use a regex pattern:

In[72]:
df[~df['sms'].str.contains(r'[^\x00-\x7F]+')]

Out[72]: 
   id city department          sms  category
0   1  khi    revenue  quk respns.         1
1   2  lhr    revenue        good.         1

So the regex patter '[^\x00-\x7F]+' here it looks for hex values in the ascii range up to 128 (7f) so it looks for characters in range 0-128 (not including 128), and we negate this using ^ so that it's looking for the presence of non-ascii anywhere in the text, we invert this mask using ~ and use this to mask the df

edited Sep 07 '17 at 12:11

answered Sep 07 '17 at 10:48

EdChum

376,765
198
813
562

`str.isalnum('å')` -> `True`... I don't think OP wants _any_ non-ascii character. – cs95 Sep 07 '17 at 10:54
@cᴏʟᴅsᴘᴇᴇᴅ why should this not be valid given the OP's posted sample, currently they haven't stated the full requirements other than filtering garbage characters – EdChum Sep 07 '17 at 10:59
Err... fair point. Btw your rows are skewed :p – cs95 Sep 07 '17 at 11:00
@cᴏʟᴅsᴘᴇᴇᴅ [GSM 03.38](https://www.csoft.co.uk/support/character-sets#gsm_set) supports that character, otherwise Nokia and Ericcson phones would have been useless in their native countries, also I'm using ipython QTConsole so the output just looks like that – EdChum Sep 07 '17 at 11:02
@EdChum I have updated the question. You are implementing this on `department` column, while I want to do this based on `sms` column. Further, `str.isalnum()` returns false for punctuations `str.isalnum("test.") -> False`. I will try regex and will update. – Haroon S. Sep 07 '17 at 11:41
I think your original post had the columns named differently, I'll update but I think the regex should work – EdChum Sep 07 '17 at 12:05
regex worked just fine. Thanks a bunch! – Haroon S. Sep 07 '17 at 12:08

Python: Removing non-ascii characters from CSV file using pandas

1 Answers1