0

I'm in the process of tokenizing strings which contain URLs. Here is the part I use to pick up the URLs:

regex_str = [r'http[s]?://(?:[a-z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-f][0-9a-f]))+']

It picks up "regular" URLs perfectly fine; however some of the URLs look like this:

https:\/\/t.co\/c1taPXzi4X

How can I modify the regex so that it deals with the escape characters, in order to end up with a complete and clean URL?

Many thanks in advance! :)

2 Answers2

0

As pointed out in this other question, you can't add a "\" in a url. You regex seems ok to me, i've tested against regxr. The only thing I've done is scape the backslashes after http.

Community
  • 1
  • 1
  • Sorry if I was unclear, the URLs (or strings rather) with the backslashes are already in my data. So I'm trying to turn https:\/\/t.co\/c1taPXzi4X into http// t.co/c1taPXzi4X – user2763524 Jun 23 '16 at 00:21
0

Calling re.sub before you apply the regex would work

re.sub(r"\\","",r"https:\/\/abc.com\/defg") 
Yarnspinner
  • 852
  • 5
  • 7