1

I'm trying to use the re library to use regular expression to substitute text. However, even after I use astype(str), the column that I want to change is still being stored as an object.

import os
import re
import pandas as pd


#https://www.geeksforgeeks.org/update-column-value-of-csv-in-python/


# reading the csv file
df = pd.read_excel('Upload to dashboard - Untitled (1).xlsx', skiprows = 7)
df.drop(df.tail(7).index,inplace=True) # drop last n rows
print(df.head(3))

df['Publisher URL'] = df['Publisher URL'].fillna('')
df['Publisher URL'] = df['Publisher URL'].astype(str)
df['Publisher URL'] = df['Publisher URL'].str.replace('gerd#','')
print(df.dtypes)
df['Publisher URL2'] = df['Publisher URL'].str.replace('www.','')
trim = re.sub('(.*)(?:\bm\.)(.*)|(.*)','',df['Publisher URL'])
#https://docs.python.org/3/library/re.html#re.sub

print(df['trim'])
df.to_csv("C:/out.csv")

this is my output for print(df.dtypes)

[3 rows x 8 columns]
Unnamed: 0        object
Buying Agency     object
Advertiser        object
Publisher URL     object
Date              object
Buy Type          object
Ad Type           object
No. of Ads       float64

And my error is

  File ~\.spyder-py3\temp.py:30 in <module>
    trim = re.sub('(.*)(?:\bm\.)(.*)|(.*)','',df['Publisher URL'])

  File ~\Anaconda3\lib\re.py:210 in sub
    return _compile(pattern, flags).sub(repl, string, count)

TypeError: expected string or bytes-like object

How can I change my script so I can use the re.sub function?

The regex is correct, I had some help from this question:Regex Exclusions

Aki
  • 137
  • 1
  • 4
  • 17
  • 1
    You'll also need to deal with nulls. So do `df['Publisher URL2'] = df['Publisher URL2'].fillna('')` or somethin g like that. Followed by `df['Publisher URL2'] = df['Publisher URL2'].astype(str)`. – chitown88 Oct 12 '22 at 10:36
  • 1
    You probably wanted `df['trim'] = df['Publisher URL'].str.replace(r'(.*)(?:\bm\.)(.*)', r'\1\2', regex=True)` – Wiktor Stribiżew Oct 12 '22 at 10:37
  • hi @WiktorStribiżew, I want to replace mobile version of websites, so I want to find and replace the "m." version of websites ``` https://m.gsmarena.com/sony_xperia_5_iv-price-11838.php``` should be ``` https://gsmarena.com/sony_xperia_5_iv-price-11838.php``` However, when I try and find an replace "m." this includes websites like this ``` https://redfm.ie/``` becomes ``` https://redfie/``` which changes the website name. – Aki Oct 12 '22 at 16:16
  • https://regex101.com/r/VsiuWO/1 this regex code shows the regex working the way i want it to – Aki Oct 12 '22 at 16:17
  • 1
    So, `df['trim'] = df['Publisher URL'].str.replace(r'^(.*?)\bm\.', r'\1', regex=True)` will work as you need. – Wiktor Stribiżew Oct 12 '22 at 17:26
  • 1
    It seems you have been tricked by regex101 regex flavor setting, too. You are using a PCRE option, and the replacement pattern is `$1`, in Python, it is `\1`. – Wiktor Stribiżew Oct 12 '22 at 17:33

1 Answers1

1

You can use

df['trim'] = df['Publisher URL'].str.replace(r'^(.*?)\bm\.', r'\1', regex=True)

Or, a better idea is to use

df['trim'] = df['Publisher URL'].str.replace(r'^(.*?[/.])m\.', r'\1', regex=True)

See regex demo #1 and regex demo #2

Details:

  • ^ - start of string
  • (.*?[/.]) - Group 1 (in Python, \1, not $1, refers to this group pattern match): any chars other than line break chars, as few as possible, and then a / or .
  • m\. - m. substring.
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • 1
    that's worked, thank you so much for the indepth answer! I didn't know that regex101 had a separate python option as well – Aki Oct 13 '22 at 10:49