0

My dataframe is slumped up with special chars and some company extensions that i'm trying to get rid of.

---
df
--
Microsoft inc
google INC
Apple Pvt Ltd
orc~l PvT ltd
Am#@zon Pvt Ltd


Expected output
--
df
--
Microsoft
google
Apple
oracl
Amazon


What i tried
word_list= ['inc','INC','Pvt Ltd', 'PvT ltd']
df1= ''.join([repl if idx in word_list else idx for idx in df])

  • 1
    I could answer this, but I suspect it's better to change how you populate your df - where is the data coming from and how do you get it in a df? – doctorlove Dec 02 '21 at 14:16
  • df is procured from a rdbms table. ( df = SELECT COL_A FROM TBL ) . I tried to fix the data via SQL but i realized that my query is getting lengthier when more new unwanted texts gets added on ) –  Dec 02 '21 at 14:17
  • What data type is the column? I'm suspecting some unicode/codec issues. – doctorlove Dec 02 '21 at 14:18
  • The data comes from an unknown (3rd party)source to the DB. –  Dec 02 '21 at 14:21
  • OK, but what type is it in the dataframe? (This might help: https://stackoverflow.com/questions/42421967/unicode-datas-of-a-dataframe-to-strings) – doctorlove Dec 02 '21 at 14:26
  • dtype: object is the datatype of the df. –  Dec 02 '21 at 14:30
  • You might need something like `df[column]=[column].str.encode('utf-8')` – doctorlove Dec 02 '21 at 15:05

0 Answers0