0

I'm finding it really difficult to create a regex (rubular) syntax that I can use with our crawler to pull all the URLs that end with the word 'download'. Could you please help? Thanks so much!

Here are the URLs to match

https://www.example.com/folder1/download
https://www.example.com/folder1/download/
https://www.example.com/folder1/folder2/download?cmp=abc

Notes: i. There ca be more than one folders before the ending word ii. The ending word can have a query string attached to it or a forward slash iii. The URLs are mostly relative URLs. But it would be really better if the regex matches absolute URLs, URLs without either protocols specified, with or without the www part as well.

Ex.
<a href="/product-category/product-name/download">Download Tool</a>
Or
<a href="https://www.example.com/product-category/product-name/download">Download Tool</a>
Or
<a href="http://www.example.com/product-category/product-name/download">Download Tool</a>
Or
<a href="www.example.com/product-category/product-name/download">Download Tool</a>
Or
<a href="example.com/product-category/product-name/download">Download Tool</a>

Although most of the above would end up in a 301 redirect or cannot be considered as a valid URL, it would still be great to find such anomalies as part of this crawl.

Crawler background: This is the regex setting tab - https://www.screencast.com/t/LJsKobubg3 This is one of the custom crawl I managed to run in the past using regex with the help of the Dev team (who's unreachable now) - https://www.screencast.com/t/9mT2pSoP7sI This is how the end result would look - https://www.screencast.com/t/MC5MNaJXi

The end result is a spreadsheet that shows all the soruce pages + URL matches.

I was given with a regex as this but this doesn't match the relative URLs and also pulls all the surrounding HTML texts in the end result report, not only the URL. https://regex101.com/r/5nHp8s/1

Once again thanks so much for helping me.

Iam_Amjath
  • 113
  • 1
  • 8
  • 1
    http://regex101.com is where you should start. Making regular expressions beyond trivial examples will be hard to design for you because there will always be one more requirement. –  May 29 '18 at 14:37
  • Thanks. I've added more details now, could you please help? I use regex101.com for testing the regex and I'm not in that level yet to create this on my own, so would much appreciate if you could help me with it. – Iam_Amjath May 29 '18 at 17:42
  • Like I said, this can get complicated fast. Start simple, like `\/.*\/(download).*$` or `\/(download).*$`. For example, we still don't know if you want capture group(s). Matching on a word is easy. It's the other stuff that gets hard. –  May 29 '18 at 17:53
  • Thanks for the heads-up. Like I said, I'm a complete novice when it comes to regex. I have to give a serious thought about it and invest some time learning it, but this request is of very urgent, so any help would be really great. I'm not sure what you meant by 'want to capture groups'. Did you mean more than once instances on a crawled webpage? If so, yes that's correct. There's an option within the crawler tool to select for example 'first 5 matches'. Thanks – Iam_Amjath May 29 '18 at 18:22
  • So, regular expressions are used for matching things. But, what do you want to do when something matches? Do you want to extract part of the match for use elsewhere? Keep track of count? A capture group helps with the latter. If all you need to do is match on any URL that ends in some sort of variation on ".../download... then something like `^.*\/[Dd]ownload.*$` should work. adjust as necessary. –  May 29 '18 at 18:38
  • https://regex101.com/r/EV6Hfk/1 –  May 29 '18 at 18:43
  • Possible duplicate of [parse URL with regex in python](https://stackoverflow.com/questions/10009523/parse-url-with-regex-in-python) – Brett7533 May 31 '18 at 06:43
  • Use the `urlparse`! not `regex`. Note that urlparse -> urllib.parse in python 3 – Brett7533 May 31 '18 at 06:45
  • Thanks. But I'm not really sure what that means. All I'm looking for is one line of regex to check for URL matches through the website crawl and then pull it in the report like how this single line of regex pulled all the Vidyard URLs at the end of the web crawl https://www.screencast.com/t/9mT2pSoP7sI – Iam_Amjath May 31 '18 at 20:00

0 Answers0