Using regex to deal with escape characters in URLs

Question

I'm in the process of tokenizing strings which contain URLs. Here is the part I use to pick up the URLs:

regex_str = [r'http[s]?://(?:[a-z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-f][0-9a-f]))+']

It picks up "regular" URLs perfectly fine; however some of the URLs look like this:

https:\/\/t.co\/c1taPXzi4X

How can I modify the regex so that it deals with the escape characters, in order to end up with a complete and clean URL?

Many thanks in advance! :)

score 0 · Answer 1 · edited May 23 '17 at 12:31

0

As pointed out in this other question, you can't add a "\" in a url. You regex seems ok to me, i've tested against regxr. The only thing I've done is scape the backslashes after http.

edited May 23 '17 at 12:31

Community

1
1

answered Jun 23 '16 at 00:07

Tomás Gonzalez Dowling

179
1
15

Sorry if I was unclear, the URLs (or strings rather) with the backslashes are already in my data. So I'm trying to turn https:\/\/t.co\/c1taPXzi4X into http// t.co/c1taPXzi4X – user2763524 Jun 23 '16 at 00:21

score 0 · Answer 2 · answered Jun 23 '16 at 12:59

0

Calling re.sub before you apply the regex would work

re.sub(r"\\","",r"https:\/\/abc.com\/defg")

answered Jun 23 '16 at 12:59

Yarnspinner

852
5
7

Using regex to deal with escape characters in URLs

2 Answers2