0

I want to remove all url from a column. The column has string format. My Dataframe has two columns: str_val[str], str_length[int]. I am using following code:

t1 = time.time()
reg_exp_val = r"((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+)"
df_mdr_pd['str_val1'] = df_mdr_pd.str_val.str.replace(reg_exp_val, r'')
print(time.time()-t1)

When I run the code for 10000 instance, it is finished in 0.6 seconds. For 100000 instances the execution just gets stuck. I tried using .loc[i, i+10000] and run it in for cycle but it did not help either.

Maria
  • 515
  • 4
  • 17
  • 1
    Using your code, I am getting around 2s for 1M row DF of randomly generated URLs so can't explain your timing. A simpler regex should be possible [see link](https://stackoverflow.com/questions/11331982) but does not explain your timing. – user19077881 Feb 06 '23 at 12:43
  • 1
    it is probably RAM. I would try convert the column to a list apply the reg-ex in the list and turn the back into df after processing. DataFrames have a large overhead. – Atanas Atanasov Feb 06 '23 at 12:51
  • @user19077881, yeah, I copied the RegExp from verified source (as I thought) but it got stuck for some if my code examples – Maria Feb 06 '23 at 16:26

1 Answers1

0

The problem was due to the reg exp I was using. The one, which worked for me was

r"(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]|\(([^\s()<>]|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'\".,<>?«»“”‘’]))",

Which was taken from this link.

Maria
  • 515
  • 4
  • 17