Before you automatically mark me down or assume that this question is asked without research please read my post first. I believe that this is a slightly more difficult problem than it appears... Edit this may be a pydev problem as regex checkers state that the solution(s) should work
I looked online but could only find articles pertaining to examples such as how to find a string and either or (x,y,z) characters. Ex: python's re: return True if regex contains in the string. Where in order to find bar, bad, or baz you simply need to do: ba[d|r|z].
I am currently pulling in a websites source code an analyzing it. I am currently pulling in each inner section of the code that contains a relevant url (.swf). It might look like: { my variable .... my other stuff.... my url.swf my other url... etc }
I have these successfully pulled in. Admittedly I am new to python (primarily java, action script and javascript in the past). What is unique about my issue is that the formatting of urls varies quite a good deal.
I can look for a url that starts with http, https, or www like this
(http|https|www)[^"]+ or something similar however urls on a websites source may contain urls like this: "//blah" which actually mean: http://www.myurl.com/blah I have solved interpreting and correcting this by concatenating it together later.
My issue is in reliably finding the urls within a string because of the "// character"
Essentially I am looking for a way to regex match strings instead of characters but in an or manner with preference given to the earlier strings. For example I know I can match "http" and only look for strings with that and do each one individually ex: http[^\'|;|,|(|)|{|}|=]\" www.[^\'|;|,|(|)|{|}|=]\" etc. however I'd prefer to do it in one line and resolve the issue of http://... and //... being picked up as different or my code simply knocking off the http: because that changes how the url is interpreted.
They can start with http, https, www, or // (I lower cased the string that I am comparing to) and I've determined that they end with a " in nearly every case.
So my regex looks likes this:
(http|https|www|//)[^\'|;|,|(|)|{|}|=]*\"'
However it's not currently working.
I don't know how to or strings everywhere I look it comes up with characters. I've tried encasing the strings in () within a [] etc but to no avail.
Oh and I'm using python.
An example of what I'm using as the original text would be:
str: {var >a=\"moatpx\"+s,b=y.createelement(\"object\");b.setattribute(\"data\",\"http://o.aolcdn.com/os/moat/>prod/p5.v1e.swf\");b.setattribute(\"id\",a);b.setattribute(\"name\",a);b.setattribute(\"style\",v);>b.setattribute(\"width\",e+\"\");b.setattribute(\"height\",t+\"\");d(b,\"flashvars\",k);d(b,\"wmode>\",\"transparent\");d(b,\"bgcolor\",\"\");d(b,\"allowscriptaccess\",\"always\");var >a=\ny.body,c=y.createelement(\"div\");c.id=\"moatpxdiv\"+s;c.style.width=\"0px\";c.style.height=\"0>px\";a.insertbefore(c,a.firstchild);c.appendchild(b);return!0}
Which doesn't give me the desired result.
Instead of pulling http://o.aolcdn.com/os/moat/>prod/p5.v1e.swf\ only I pull:
(for each element in the array), (the string)
00 str: =\\"moatpx\\"
01 str: (\\"object\\"
02 str: (\\"data\\"
03 str: http://o.aolcdn.com/os/moat/prod/p5.v1e.swf\\"
04 str: (\\"id\\"
05 str: (\\"name\\"
06 str: (\\"style\\"
07 str: (\\"width\\"
08 str: (\\"height\\"
09 str: t+\\"\\"
10 str: shvars\\"
11 str: wmode\\"
12 str: transparent\\"
13 str: wscriptaccess\\"
14 str: ways\\"
15 str: (\\"div\\"
16 str: =\\"moatpxdiv\\"
17 str: =\\"0px\\"
18 str: =\\"0px\\"
Thank you!
Oh edit: If you feel that my regex is not accurate and needs to be fixed here are my requirements: It must take in a string starting with http, https, www, or //. It must prefer http/https to www, and to //. It must end with the first " it comes in contact with. It must also be a normal url, it may not contain commas (,), ; etc.
Test cases (had to add spaces to make it not a url due to stack overflow limit) :
str: {var a=\"moatpx\"+s,b=y.createelement(\"object\");b.setattribute(\"data\",\"http: //o. aolcdn.com/os/moat/prod/p5.v1e.swf\");b.setattribute(\"id\",a);b.setattribute(\"name\",a);b.setattri bute(\"style\",v);b.setattribute(\"width\",e+\"\");b.setattribute(\"height\",t+\"\");d(b,\"flashvars\",k);d(b,\"wmode\",\"transparent\");d(b,\"bgcolor\",\"\");d(b,\"allowscriptaccess\",\"always\");var a=\ny.body,c=y.createelement(\"div\");c.id=\"moatpxdiv\"+s;c.style.width=\"0px\";c.style.height=\"0px\";a.insertbefore(c,a.firstchild);c.appendchild(b);return!0}
With regex:
(http:\/\/|https:\/\/|www\.|\/\/)[^"]+
it matched "http://" instead of the pull url.
In regex101 checker it states that it should run however in my code it does not.
My code: links = re.findall('(http://|https://|www.|//)[^"]+', obj) has an obj = the above code block, and returns links ["http://"].
This is in pydev and I'm looking through the debugger.
Solution As of Now:
(?:http:\/\/|https:\/\/|www\.|\/\/)[^"]+)