Url Regex: Finding Either or string(s) within another string with preference to an earlier one

Question

Before you automatically mark me down or assume that this question is asked without research please read my post first. I believe that this is a slightly more difficult problem than it appears... Edit this may be a pydev problem as regex checkers state that the solution(s) should work

I looked online but could only find articles pertaining to examples such as how to find a string and either or (x,y,z) characters. Ex: python's re: return True if regex contains in the string. Where in order to find bar, bad, or baz you simply need to do: ba[d|r|z].

I am currently pulling in a websites source code an analyzing it. I am currently pulling in each inner section of the code that contains a relevant url (.swf). It might look like: { my variable .... my other stuff.... my url.swf my other url... etc }

I have these successfully pulled in. Admittedly I am new to python (primarily java, action script and javascript in the past). What is unique about my issue is that the formatting of urls varies quite a good deal.

I can look for a url that starts with http, https, or www like this

(http|https|www)[^"]+ or something similar however urls on a websites source may contain urls like this: "//blah" which actually mean: http://www.myurl.com/blah I have solved interpreting and correcting this by concatenating it together later.

My issue is in reliably finding the urls within a string because of the "// character"

Essentially I am looking for a way to regex match strings instead of characters but in an or manner with preference given to the earlier strings. For example I know I can match "http" and only look for strings with that and do each one individually ex: http[^\'|;|,|(|)|{|}|=]\" www.[^\'|;|,|(|)|{|}|=]\" etc. however I'd prefer to do it in one line and resolve the issue of http://... and //... being picked up as different or my code simply knocking off the http: because that changes how the url is interpreted.

They can start with http, https, www, or // (I lower cased the string that I am comparing to) and I've determined that they end with a " in nearly every case.

So my regex looks likes this:

(http|https|www|//)[^\'|;|,|(|)|{|}|=]*\"'

However it's not currently working.

I don't know how to or strings everywhere I look it comes up with characters. I've tried encasing the strings in () within a [] etc but to no avail.

Oh and I'm using python.

An example of what I'm using as the original text would be:

str: {var >a=\"moatpx\"+s,b=y.createelement(\"object\");b.setattribute(\"data\",\"http://o.aolcdn.com/os/moat/>prod/p5.v1e.swf\");b.setattribute(\"id\",a);b.setattribute(\"name\",a);b.setattribute(\"style\",v);>b.setattribute(\"width\",e+\"\");b.setattribute(\"height\",t+\"\");d(b,\"flashvars\",k);d(b,\"wmode>\",\"transparent\");d(b,\"bgcolor\",\"\");d(b,\"allowscriptaccess\",\"always\");var >a=\ny.body,c=y.createelement(\"div\");c.id=\"moatpxdiv\"+s;c.style.width=\"0px\";c.style.height=\"0>px\";a.insertbefore(c,a.firstchild);c.appendchild(b);return!0}

Which doesn't give me the desired result.

Instead of pulling http://o.aolcdn.com/os/moat/>prod/p5.v1e.swf\ only I pull:
(for each element in the array), (the string)
00  str: =\\"moatpx\\"  
01  str: (\\"object\\"  
02  str: (\\"data\\"    
03  str: http://o.aolcdn.com/os/moat/prod/p5.v1e.swf\\" 
04  str: (\\"id\\"  
05  str: (\\"name\\"    
06  str: (\\"style\\"   
07  str: (\\"width\\"   
08  str: (\\"height\\"  
09  str: t+\\"\\"   
10  str: shvars\\"  
11  str: wmode\\"   
12  str: transparent\\" 
13  str: wscriptaccess\\"   
14  str: ways\\"    
15  str: (\\"div\\" 
16  str: =\\"moatpxdiv\\"   
17  str: =\\"0px\\" 
18  str: =\\"0px\\"

Thank you!

Oh edit: If you feel that my regex is not accurate and needs to be fixed here are my requirements: It must take in a string starting with http, https, www, or //. It must prefer http/https to www, and to //. It must end with the first " it comes in contact with. It must also be a normal url, it may not contain commas (,), ; etc.

Test cases (had to add spaces to make it not a url due to stack overflow limit) :

str: {var a=\"moatpx\"+s,b=y.createelement(\"object\");b.setattribute(\"data\",\"http: //o.          aolcdn.com/os/moat/prod/p5.v1e.swf\");b.setattribute(\"id\",a);b.setattribute(\"name\",a);b.setattri bute(\"style\",v);b.setattribute(\"width\",e+\"\");b.setattribute(\"height\",t+\"\");d(b,\"flashvars\",k);d(b,\"wmode\",\"transparent\");d(b,\"bgcolor\",\"\");d(b,\"allowscriptaccess\",\"always\");var a=\ny.body,c=y.createelement(\"div\");c.id=\"moatpxdiv\"+s;c.style.width=\"0px\";c.style.height=\"0px\";a.insertbefore(c,a.firstchild);c.appendchild(b);return!0}

With regex:

(http:\/\/|https:\/\/|www\.|\/\/)[^"]+

it matched "http://" instead of the pull url.

In regex101 checker it states that it should run however in my code it does not.

My code: links = re.findall('(http://|https://|www.|//)[^"]+', obj) has an obj = the above code block, and returns links ["http://"].

This is in pydev and I'm looking through the debugger.

Solution As of Now:

(?:http:\/\/|https:\/\/|www\.|\/\/)[^"]+)

to define a character set [], don't use | at all. some character need to be escaped like \. try `(http|https|www)[^\\';,(){}=]*\"'` — Fabricator, Jun 05 '14 at 00:52

Fabricator · Accepted Answer · 2014-06-05T18:21:26.530

0

here you go:

print re.findall('((?:http://|https://|www\.|//)[^"]+)', s)

(?:) means non-capture group

when it was (http://|https://|www\.)[^"]+, it only returns the captured result http

edited Jun 05 '14 at 18:21

answered Jun 05 '14 at 00:58

Fabricator

12,722
2
27
40

according to: http://regex101.com this does not match for an example of : www.asdfs.com.gif" in addition it lacks the \/\/ portion. Another issue I was having was that it was matching the // of the https:// and going straight to //... my something. I want it to select the https before the // so a url https://www.myurl.com" is selected instead of //www.myurls.com" because on a webpage's source those are two different urls. For example //myotherurl/asdfs" is actually http://www.theurlImOn/myotherurl/asdfs" – Tai Jun 05 '14 at 00:59
can you provide the test cases? the earlier one only works for the one you provided – Fabricator Jun 05 '14 at 01:03
I'm sorry I edited my above post. The issue is the //. I need it to not cut off the http and use the // only. Some urls have //something, and others are http://something. The issue is //urls are different than http ones. // ones must be concatenated onto the base url to be readable. – Tai Jun 05 '14 at 01:11
@TaiHirabayashi, I'm still unsure about your exact request. can you provide some test strings, and what the matches should be? – Fabricator Jun 05 '14 at 01:14
For an example: str: {var a=\"moatpx\"+s,b=y.createelement(\"object\");b.setattribute(\"data\",\"http://o.aolcdn.com/os/moat/prod/p5.v1e.swf\");b.setattribute(\"id\",a);b.setattribute(\"name\",a);b.setattribute(\"style\",v);b.setattribute(\"width\",e+\"\");b.setattribute(\"height\",t+\"\");d(b,\"flashvars\",k);d(b,\"wmode\",\"transparent\");d(b,\"bgcolor\",\"\");d(b,\"allowscriptaccess\",\"always\");var a=\ny.body,c=y.createelement(\"div\");c.id=\"moatpxdiv\"+s;c.style.width=\"0px\";c.style.height=\"0px\";a.insertbefore(c,a.firstchild);c.appendchild(b);return!0} -- The code pulled "http://" – Tai Jun 05 '14 at 01:18
Ok: So for the above string I get "http://" instead of http://something when I run it in pydev. however in a regex checker it states that it should work. Which is strange any ideas? – Tai Jun 05 '14 at 01:25
@TaiHirabayashi, I see. looks like it only returned what's captured in (). I'll make the fix – Fabricator Jun 05 '14 at 02:07
Thanks for fixing that however I still am not pulling //url.com" for example – Tai Jun 05 '14 at 18:13
I believe this works as a solution: (?:http:\/\/|https:\/\/|www\.|\/\/)[^"]+ Thank you very much for your help – Tai Jun 05 '14 at 18:15

Padraic Cunningham · Answer 2 · 2014-06-05T02:00:47.170

0

# Using your example:

s = "{\"www.asdfs.com.gif\",'var' >a=\"moatpx\"+s,b=y.createelement(\"object\");b.setattribute('www:'\"data\",\"http://.o.aolcdn.com/os/moat/>prod/p5.v1e.swf\"}"



print  re.findall('\"(www.*?|\w+\:\/\/.*?)\"',s)

['www.asdfs.com.gif', 'http://.o.aolcdn.com/os/moat/>prod/p5.v1e.swf']

edited Jun 05 '14 at 02:00

answered Jun 05 '14 at 01:33

Padraic Cunningham

176,452
29
245
321

This does not find all of the cases in regex101 tester that I'm using. It finds http.... but not //asdfsd.com for example – Tai Jun 05 '14 at 18:12
It finds any www. Http or https in the same format as your example – Padraic Cunningham Jun 05 '14 at 18:23
In my example I needed to handle //url.blah also and ensure that http:// was handled before // which was what made the question difficult for me. Thank you for your help. I believe user3678068 provided the solution. – Tai Jun 05 '14 at 18:31

score 0 · Answer 3 · answered Jun 05 '14 at 18:37

In order to make sure this is easy to read, I'm posting the solution here so it isn't nested in the comments. The correct solution is:

            links = re.findall('(?:http:\/\/|https:\/\/|www\.|\/\/)[^"]+', obj)

Many thanks to user3678068 for helping solve this.

For something like this:

str: {var a=\"moatpx\"+s,b=y.createelement(\"object\");b.setattribute(\"data\",\"http://o.aolcdn.com/os/moat/prod/p5.v1e.swf\");b.setattribute(\"id\",a);b.setattribute(\"name\",a);b.setattribute(\"style\",v);b.setattribute(\"width\",e+\"\");b.setattribute(\"height\",t+\"\");d(b,\"flashvars\",k);d(b,\"wmode\",\"transparent\");d(b,\"bgcolor\",\"\");d(b,\"allowscriptaccess\",\"always\");var a=\ny.body,c=y.createelement(\"div\");c.id=\"moatpxdiv\"+s;c.style.width=\"0px\";c.style.height=\"0px\";a.insertbefore(c,a.firstchild);c.appendchild(b);return!0}

I receive:

str: http://o.aolcdn.com/os/moat/prod/p5.v1e.swf\

score 0 · Answer 4 · answered Jun 21 '17 at 11:08

Tough the accepted answer might have been sufficient for you, my best suggestion would be to use urlparse (https://docs.python.org/2/library/urlparse.html -> for python 2.x and https://docs.python.org/3.0/library/urllib.parse.html -> for python 3.x)

this takes care of all types of protocols - deal with complete HTTP URL spec, gives output in easily usable form, and you don't have to re-invent the wheel!

Url Regex: Finding Either or string(s) within another string with preference to an earlier one

4 Answers4