3

I have a csv file of text SMS in utf-8 encoding.

import pandas as pd
data = pd.read_csv('my_data.csv', sep=',')
data.head()

It has output like:

id    city    department    sms                    category
01    khi      revenue      quk respns.                1
02    lhr      revenue      good.                      1
03    lhr      revenue      †h\0h2h\0hh\               0
04    isb      accounts     ?xœ1øiûüð÷üœç8i            0
05    isb      accounts     %â¡ã‘ã¸$ãªã±t%rã«ãÿã©â£    0

I want to remove all the records/rows where sms column has garbage values such as in record 3,4 and 5. May be they were written in a language other than English I am not so sure what happened to these records. Record 1 and 2 are okay to keep although language used in sms column is informal (as people normally do in text messages). What would be the convenient way to achieve that given that I have around 2 million records.

Edit: I want to remove any row with non-ascii characters in sms column.

EdChum
  • 376,765
  • 198
  • 813
  • 562
Haroon S.
  • 2,533
  • 6
  • 20
  • 39
  • Can you clarify what the full requirement is, for instance are you trying to filter out stuff that is not all English? Not ascii? – EdChum Sep 07 '17 at 10:58
  • I want to filter all records whose "sms" value is non-ascii. – Haroon S. Sep 07 '17 at 11:02
  • [GSM 03.38](https://www.csoft.co.uk/support/character-sets#gsm_set) supports non-ascii characters so I'm not sure it's valid to restrict it to just ascii – EdChum Sep 07 '17 at 11:03
  • @SalA. You might or might not want to keep certain non-ascii characters, but it solely depends on your use case. Ed Chum's answer retains them, my answer drops them. Carefully think and decide what it is you really need. – cs95 Sep 07 '17 at 11:04
  • Additionally it's possible to support other languages too https://en.wikipedia.org/wiki/GSM_03.38 – EdChum Sep 07 '17 at 11:05
  • See my updated answer, this does the filtering in a single pass – EdChum Sep 07 '17 at 11:25

1 Answers1

9

Use str.isalnum to filter the rows that are not alphanumeric:

In[46]:
df[df['sms'].str.isalnum()]

Out[46]: 
   id city department          sms  category
0   1  khi    revenue  quk respns.         1
1   2  lhr    revenue        good.         1

update

if you really want to filter out any rows containing non-ascii characters then you can use a regex pattern:

In[72]:
df[~df['sms'].str.contains(r'[^\x00-\x7F]+')]

Out[72]: 
   id city department          sms  category
0   1  khi    revenue  quk respns.         1
1   2  lhr    revenue        good.         1

So the regex patter '[^\x00-\x7F]+' here it looks for hex values in the ascii range up to 128 (7f) so it looks for characters in range 0-128 (not including 128), and we negate this using ^ so that it's looking for the presence of non-ascii anywhere in the text, we invert this mask using ~ and use this to mask the df

EdChum
  • 376,765
  • 198
  • 813
  • 562
  • `str.isalnum('å')` -> `True`... I don't think OP wants _any_ non-ascii character. – cs95 Sep 07 '17 at 10:54
  • @cᴏʟᴅsᴘᴇᴇᴅ why should this not be valid given the OP's posted sample, currently they haven't stated the full requirements other than filtering garbage characters – EdChum Sep 07 '17 at 10:59
  • Err... fair point. Btw your rows are skewed :p – cs95 Sep 07 '17 at 11:00
  • @cᴏʟᴅsᴘᴇᴇᴅ [GSM 03.38](https://www.csoft.co.uk/support/character-sets#gsm_set) supports that character, otherwise Nokia and Ericcson phones would have been useless in their native countries, also I'm using ipython QTConsole so the output just looks like that – EdChum Sep 07 '17 at 11:02
  • @EdChum I have updated the question. You are implementing this on `department` column, while I want to do this based on `sms` column. Further, `str.isalnum()` returns false for punctuations `str.isalnum("test.") -> False`. I will try regex and will update. – Haroon S. Sep 07 '17 at 11:41
  • I think your original post had the columns named differently, I'll update but I think the regex should work – EdChum Sep 07 '17 at 12:05
  • regex worked just fine. Thanks a bunch! – Haroon S. Sep 07 '17 at 12:08