0

I have a list of words in dataframe which I would like to replace with empty string. I have a column named source which I have to clean properly. e.g replace 'siliconvalley.co' to 'siliconvalley'

I created a list which is

list = ['.com','.co','.de','.co.jp','.co.uk','.lk','.it','.es','.ua','.bg','.at','.kr']

and replace them with empty string

for l in list:
    df['source'] = df['source'].str.replace(l,'')

In the output, I am getting 'silinvalley' which means it has also replaced 'co' instead of '.co' I want the code to replace the data which is exactly matching the pattern. Please help!

Dhiraj D
  • 63
  • 8
  • 2
    `Series.str.replace()` handles your replacement strings as regular expressions per default, so the dot means "any character". You either need to pass `regex=False` or escape the dots with a backslash, e.g., `'\.com'` instead of `'.com'`, then they will match only literal dots. – fsimonjetz Jul 15 '21 at 11:21
  • Could you please give us a small amount of your dataframe and what output you had? –  Jul 15 '21 at 11:22
  • 1
    By the way, it's bad practice to use builtin names like `list` as variable names, better use something like `domains = ['\.com', ...]`. – fsimonjetz Jul 15 '21 at 11:23
  • Does this answer your question? [How to remove strings present in a list from a column in pandas](https://stackoverflow.com/questions/51666374/how-to-remove-strings-present-in-a-list-from-a-column-in-pandas) – Henry Ecker Jul 15 '21 at 11:29
  • do you just want to remove domain TLD from the string? – Joshua Jul 15 '21 at 11:29

1 Answers1

0

This would be one way. Would have to be careful with the order of replacement. If '.co' comes before '.co.uk' you don't get the desired result.

df["source"].replace('|'.join([re.escape(i) for i in list_]), '', regex=True)

Minimal example:

import pandas as pd
import re

list_ = ['.com','.co.uk','.co','.de','.co.jp','.lk','.it','.es','.ua','.bg','.at','.kr']

df = pd.DataFrame({
    'source': ['google.com', 'google.no', 'google.co.uk']
})

pattern = '|'.join([re.escape(i) for i in list_])

df["new_source"] = df["source"].replace(pattern, '', regex=True)

print(df)
#         source new_source
#0    google.com     google
#1     google.no  google.no
#2  google.co.uk     google
Anton vBR
  • 18,287
  • 5
  • 40
  • 46