0

Before you automatically mark me down or assume that this question is asked without research please read my post first. I believe that this is a slightly more difficult problem than it appears... Edit this may be a pydev problem as regex checkers state that the solution(s) should work

I looked online but could only find articles pertaining to examples such as how to find a string and either or (x,y,z) characters. Ex: python's re: return True if regex contains in the string. Where in order to find bar, bad, or baz you simply need to do: ba[d|r|z].

I am currently pulling in a websites source code an analyzing it. I am currently pulling in each inner section of the code that contains a relevant url (.swf). It might look like: { my variable .... my other stuff.... my url.swf my other url... etc }

I have these successfully pulled in. Admittedly I am new to python (primarily java, action script and javascript in the past). What is unique about my issue is that the formatting of urls varies quite a good deal.

I can look for a url that starts with http, https, or www like this

(http|https|www)[^"]+ or something similar however urls on a websites source may contain urls like this: "//blah" which actually mean: http://www.myurl.com/blah I have solved interpreting and correcting this by concatenating it together later.

My issue is in reliably finding the urls within a string because of the "// character"

Essentially I am looking for a way to regex match strings instead of characters but in an or manner with preference given to the earlier strings. For example I know I can match "http" and only look for strings with that and do each one individually ex: http[^\'|;|,|(|)|{|}|=]\" www.[^\'|;|,|(|)|{|}|=]\" etc. however I'd prefer to do it in one line and resolve the issue of http://... and //... being picked up as different or my code simply knocking off the http: because that changes how the url is interpreted.

They can start with http, https, www, or // (I lower cased the string that I am comparing to) and I've determined that they end with a " in nearly every case.

So my regex looks likes this:

(http|https|www|//)[^\'|;|,|(|)|{|}|=]*\"'

However it's not currently working.

I don't know how to or strings everywhere I look it comes up with characters. I've tried encasing the strings in () within a [] etc but to no avail.

Oh and I'm using python.

An example of what I'm using as the original text would be:

str: {var >a=\"moatpx\"+s,b=y.createelement(\"object\");b.setattribute(\"data\",\"http://o.aolcdn.com/os/moat/>prod/p5.v1e.swf\");b.setattribute(\"id\",a);b.setattribute(\"name\",a);b.setattribute(\"style\",v);>b.setattribute(\"width\",e+\"\");b.setattribute(\"height\",t+\"\");d(b,\"flashvars\",k);d(b,\"wmode>\",\"transparent\");d(b,\"bgcolor\",\"\");d(b,\"allowscriptaccess\",\"always\");var >a=\ny.body,c=y.createelement(\"div\");c.id=\"moatpxdiv\"+s;c.style.width=\"0px\";c.style.height=\"0>px\";a.insertbefore(c,a.firstchild);c.appendchild(b);return!0}

Which doesn't give me the desired result.

Instead of pulling http://o.aolcdn.com/os/moat/>prod/p5.v1e.swf\ only I pull:
(for each element in the array), (the string)
00  str: =\\"moatpx\\"  
01  str: (\\"object\\"  
02  str: (\\"data\\"    
03  str: http://o.aolcdn.com/os/moat/prod/p5.v1e.swf\\" 
04  str: (\\"id\\"  
05  str: (\\"name\\"    
06  str: (\\"style\\"   
07  str: (\\"width\\"   
08  str: (\\"height\\"  
09  str: t+\\"\\"   
10  str: shvars\\"  
11  str: wmode\\"   
12  str: transparent\\" 
13  str: wscriptaccess\\"   
14  str: ways\\"    
15  str: (\\"div\\" 
16  str: =\\"moatpxdiv\\"   
17  str: =\\"0px\\" 
18  str: =\\"0px\\" 

Thank you!

Oh edit: If you feel that my regex is not accurate and needs to be fixed here are my requirements: It must take in a string starting with http, https, www, or //. It must prefer http/https to www, and to //. It must end with the first " it comes in contact with. It must also be a normal url, it may not contain commas (,), ; etc.

Test cases (had to add spaces to make it not a url due to stack overflow limit) :

str: {var a=\"moatpx\"+s,b=y.createelement(\"object\");b.setattribute(\"data\",\"http: //o.          aolcdn.com/os/moat/prod/p5.v1e.swf\");b.setattribute(\"id\",a);b.setattribute(\"name\",a);b.setattri bute(\"style\",v);b.setattribute(\"width\",e+\"\");b.setattribute(\"height\",t+\"\");d(b,\"flashvars\",k);d(b,\"wmode\",\"transparent\");d(b,\"bgcolor\",\"\");d(b,\"allowscriptaccess\",\"always\");var a=\ny.body,c=y.createelement(\"div\");c.id=\"moatpxdiv\"+s;c.style.width=\"0px\";c.style.height=\"0px\";a.insertbefore(c,a.firstchild);c.appendchild(b);return!0}

With regex:

(http:\/\/|https:\/\/|www\.|\/\/)[^"]+

it matched "http://" instead of the pull url.

In regex101 checker it states that it should run however in my code it does not.

My code: links = re.findall('(http://|https://|www.|//)[^"]+', obj) has an obj = the above code block, and returns links ["http://"].

This is in pydev and I'm looking through the debugger.

Solution As of Now:

(?:http:\/\/|https:\/\/|www\.|\/\/)[^"]+)
Community
  • 1
  • 1
Tai
  • 1,206
  • 5
  • 23
  • 48

4 Answers4

0

here you go:

print re.findall('((?:http://|https://|www\.|//)[^"]+)', s)

(?:) means non-capture group

when it was (http://|https://|www\.)[^"]+, it only returns the captured result http

Fabricator
  • 12,722
  • 2
  • 27
  • 40
  • according to: http://regex101.com this does not match for an example of : www.asdfs.com.gif" in addition it lacks the \/\/ portion. Another issue I was having was that it was matching the // of the https:// and going straight to //... my something. I want it to select the https before the // so a url https://www.myurl.com" is selected instead of //www.myurls.com" because on a webpage's source those are two different urls. For example //myotherurl/asdfs" is actually http://www.theurlImOn/myotherurl/asdfs" – Tai Jun 05 '14 at 00:59
  • can you provide the test cases? the earlier one only works for the one you provided – Fabricator Jun 05 '14 at 01:03
  • I'm sorry I edited my above post. The issue is the //. I need it to not cut off the http and use the // only. Some urls have //something, and others are http://something. The issue is //urls are different than http ones. // ones must be concatenated onto the base url to be readable. – Tai Jun 05 '14 at 01:11
  • @TaiHirabayashi, I'm still unsure about your exact request. can you provide some test strings, and what the matches should be? – Fabricator Jun 05 '14 at 01:14
  • For an example: str: {var a=\"moatpx\"+s,b=y.createelement(\"object\");b.setattribute(\"data\",\"http://o.aolcdn.com/os/moat/prod/p5.v1e.swf\");b.setattribute(\"id\",a);b.setattribute(\"name\",a);b.setattribute(\"style\",v);b.setattribute(\"width\",e+\"\");b.setattribute(\"height\",t+\"\");d(b,\"flashvars\",k);d(b,\"wmode\",\"transparent\");d(b,\"bgcolor\",\"\");d(b,\"allowscriptaccess\",\"always\");var a=\ny.body,c=y.createelement(\"div\");c.id=\"moatpxdiv\"+s;c.style.width=\"0px\";c.style.height=\"0px\";a.insertbefore(c,a.firstchild);c.appendchild(b);return!0} -- The code pulled "http://" – Tai Jun 05 '14 at 01:18
  • Ok: So for the above string I get "http://" instead of http://something when I run it in pydev. however in a regex checker it states that it should work. Which is strange any ideas? – Tai Jun 05 '14 at 01:25
  • @TaiHirabayashi, I see. looks like it only returned what's captured in (). I'll make the fix – Fabricator Jun 05 '14 at 02:07
  • Thanks for fixing that however I still am not pulling //url.com" for example – Tai Jun 05 '14 at 18:13
  • I believe this works as a solution: (?:http:\/\/|https:\/\/|www\.|\/\/)[^"]+ Thank you very much for your help – Tai Jun 05 '14 at 18:15
0

# Using your example:

s = "{\"www.asdfs.com.gif\",'var' >a=\"moatpx\"+s,b=y.createelement(\"object\");b.setattribute('www:'\"data\",\"http://.o.aolcdn.com/os/moat/>prod/p5.v1e.swf\"}"



print  re.findall('\"(www.*?|\w+\:\/\/.*?)\"',s)

['www.asdfs.com.gif', 'http://.o.aolcdn.com/os/moat/>prod/p5.v1e.swf']
Padraic Cunningham
  • 176,452
  • 29
  • 245
  • 321
  • This does not find all of the cases in regex101 tester that I'm using. It finds http.... but not //asdfsd.com for example – Tai Jun 05 '14 at 18:12
  • It finds any www. Http or https in the same format as your example – Padraic Cunningham Jun 05 '14 at 18:23
  • In my example I needed to handle //url.blah also and ensure that http:// was handled before // which was what made the question difficult for me. Thank you for your help. I believe user3678068 provided the solution. – Tai Jun 05 '14 at 18:31
0

In order to make sure this is easy to read, I'm posting the solution here so it isn't nested in the comments. The correct solution is:

            links = re.findall('(?:http:\/\/|https:\/\/|www\.|\/\/)[^"]+', obj)

Many thanks to user3678068 for helping solve this.

For something like this:

str: {var a=\"moatpx\"+s,b=y.createelement(\"object\");b.setattribute(\"data\",\"http://o.aolcdn.com/os/moat/prod/p5.v1e.swf\");b.setattribute(\"id\",a);b.setattribute(\"name\",a);b.setattribute(\"style\",v);b.setattribute(\"width\",e+\"\");b.setattribute(\"height\",t+\"\");d(b,\"flashvars\",k);d(b,\"wmode\",\"transparent\");d(b,\"bgcolor\",\"\");d(b,\"allowscriptaccess\",\"always\");var a=\ny.body,c=y.createelement(\"div\");c.id=\"moatpxdiv\"+s;c.style.width=\"0px\";c.style.height=\"0px\";a.insertbefore(c,a.firstchild);c.appendchild(b);return!0}

I receive:

str: http://o.aolcdn.com/os/moat/prod/p5.v1e.swf\
Tai
  • 1,206
  • 5
  • 23
  • 48
0

Tough the accepted answer might have been sufficient for you, my best suggestion would be to use urlparse (https://docs.python.org/2/library/urlparse.html -> for python 2.x and https://docs.python.org/3.0/library/urllib.parse.html -> for python 3.x)

this takes care of all types of protocols - deal with complete HTTP URL spec, gives output in easily usable form, and you don't have to re-invent the wheel!

Dipan Mehta
  • 2,110
  • 1
  • 17
  • 31