I have a csv file of text SMS in utf-8
encoding.
import pandas as pd
data = pd.read_csv('my_data.csv', sep=',')
data.head()
It has output like:
id city department sms category
01 khi revenue quk respns. 1
02 lhr revenue good. 1
03 lhr revenue †h\0h2h\0hh\ 0
04 isb accounts ?xœ1øiûüð÷üœç8i 0
05 isb accounts %â¡ã‘ã¸$ãªã±t%rã«ãÿã©â£ 0
I want to remove all the records/rows where sms
column has garbage values such as in record 3,4 and 5. May be they were written in a language other than English I am not so sure what happened to these records. Record 1 and 2 are okay to keep although language used in sms
column is informal (as people normally do in text messages). What would be the convenient way to achieve that given that I have around 2 million records.
Edit:
I want to remove any row with non-ascii characters in sms
column.