Remove special chars and related texts from Dataframe

Asked Dec 02 '21 at 14:13

Active Dec 02 '21 at 14:13

Viewed 21 times

My dataframe is slumped up with special chars and some company extensions that i'm trying to get rid of.

---
df
--
Microsoft inc
google INC
Apple Pvt Ltd
orc~l PvT ltd
Am#@zon Pvt Ltd


Expected output
--
df
--
Microsoft
google
Apple
oracl
Amazon


What i tried
word_list= ['inc','INC','Pvt Ltd', 'PvT ltd']
df1= ''.join([repl if idx in word_list else idx for idx in df])

asked Dec 02 '21 at 14:13

1

I could answer this, but I suspect it's better to change how you populate your df - where is the data coming from and how do you get it in a df? – doctorlove Dec 02 '21 at 14:16
df is procured from a rdbms table. ( df = SELECT COL_A FROM TBL ) . I tried to fix the data via SQL but i realized that my query is getting lengthier when more new unwanted texts gets added on ) – Dec 02 '21 at 14:17
What data type is the column? I'm suspecting some unicode/codec issues. – doctorlove Dec 02 '21 at 14:18
The data comes from an unknown (3rd party)source to the DB. – Dec 02 '21 at 14:21
OK, but what type is it in the dataframe? (This might help: https://stackoverflow.com/questions/42421967/unicode-datas-of-a-dataframe-to-strings) – doctorlove Dec 02 '21 at 14:26
dtype: object is the datatype of the df. – Dec 02 '21 at 14:30
You might need something like `df[column]=[column].str.encode('utf-8')` – doctorlove Dec 02 '21 at 15:05

Remove special chars and related texts from Dataframe

0 Answers0