0

I've been trying to get my regex to match a wide variety of download links and have narrowed down the following.

For 90% of download links they will start with either " or ' or http and end at " or ' or .exe. Three examples of this

Now the annoying part is I whipped up two regex's that cover this 90% however there has to be a way for it to only need one line of code. The only thing the user needs to change is the file extension they are looking for.

I tried $ anchoring but i'm not a regex expert so couldn't get it to work, tried to start the match at the first .exe occurance and then work its way back to match the very first " or ' or http that happens before the first .exe occurance. Yes, they do start with href= then " or ' however you can get href= and I don't know how to account for that PLUS some download links you don't want it to start from the href= and not all start with http

Example

href="/bouncer?t=http%3A%2F%2Fdownload.portableapps.com%2Fportableapps%2Ffoxitreaderportable%2FFoxitReaderPortable_4.2.paf.exe">

The two regex I have that cover the 90% of situations are

["']([^"']+(\.zip|\.rar|\.7z)) and (http[^"']+(\.zip|\.rar|\.7z))

EDIT: This is used in a program called Ketarin, which parses the HTML for me and returns the page source with which I can use the regex on. I have found that Ketarin processes regex in this fashion, Singleline and IgnoreCase.

This flavor of regex treats the entire block of text as a single line, so the . character also matches \r\n.

This aside does anyone know how to start the regex match from the end of the string and work its way back to the first found " ' or http? The closest I got was

$?[^"']*.exe

But i'm not sure how to include http as an OR inclusive match in that

user547373
  • 1
  • 1
  • 2
  • It looks like you are trying to parse HTML with regular expressions. If so, it might be better to use a HTML parser. – Mark Byers Dec 19 '10 at 00:43
  • 1
    http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – Ignacio Vazquez-Abrams Dec 19 '10 at 00:46
  • Question 1732348 doesn't really apply. He's trying to scrape something very specific from an HTML file. A regex is a reasonable tool (but maybe not the best tool due to the potential for false positives) for that. – RobertB Dec 19 '10 at 01:03

2 Answers2

0

/href[\=][\"]((.*)([.]exe))[\"]/ try this using a group match (or the scan method if you are using ruby

kellpossible
  • 663
  • 1
  • 8
  • 19
0

EDIT: Sorry, i based this off something that did work hoping it would of work... anyways:

(?<=href=").+?\.(your|extensions|here)

Hope this one does help. Put your desired extensions separated by | [like (exe:|rar|zip....)]

Good Luck

Machinarius
  • 3,637
  • 3
  • 30
  • 53
  • I'm not getting any hits with either of those regex matches. For reference sake it utilizes .NET Regex. However i'm ecstatic that people are actually helping me out, a nice change of pace for sure. – user547373 Dec 19 '10 at 09:04
  • Weird, this is just a modified Regex query that only worked for jpg files :S Will re-do it out of curiosity :) – Machinarius Dec 19 '10 at 13:43
  • `(http|ftp).+?\.(exe|css|gif|png|jpg)` Use this one if you want control over the protocol.... but the one above also gives you links based on the current page you are on, this one does not. – Machinarius Dec 19 '10 at 14:17
  • (?<=href=").+?\.(zip) <- this one finds the first href on the page all the way to the very first .zip file and captures only the .zip part. (http|ftp).+?\.(zip) <- this one captures the very first http on the page all the way to the first .zip. Either way they both return unusable values. – user547373 Dec 19 '10 at 17:29
  • If you want only ONE extension then just make it: `(http|ftp).+?\.zip` – Machinarius Dec 19 '10 at 17:52
  • Nope, that only captures the first http and nothing else. If it helps the program only allows ONE match, it doesn't return multiples. I'm not sure if that helps at all though. – user547373 Dec 19 '10 at 17:56
  • I was trying to get something like this to work $*([^"']+\.zip) which does but also add http or the very last / found. The code I have there only works for " and '. I was hoping I could add a word grouping in there but don't know how sorta like [^"':(http)] so it includes the http if it is there but excludes the " and '. – user547373 Dec 19 '10 at 17:58
  • I do not know why it doesnt work for you.... on my own tests, i find this regex `(http|ftp).+?\.gif` To match all the linked GIF files inside an html file... it should do the same for ZIP's and the like.... may i see what you are testing this Regex against? – Machinarius Dec 19 '10 at 19:08
  • When I use that regex in a tester, it returns everything perfectly fine. It may just be how this program uses regex, i'm not sure. It is just being used on any page source with a download link, again it may just be its implementation. Regardless it will only match 'one' item, not repeats or multiples. The program also allows "Right To Left" Regex and is .Net based if ANY of that helps. – user547373 Dec 20 '10 at 07:23
  • I have found that Ketarin processes regex in this fashion, Singleline and IgnoreCase. Does this help? – user547373 Dec 20 '10 at 07:32
  • I dont know then... you will have to ask the devs for some light :S – Machinarius Dec 20 '10 at 12:05