0

I'm using Scrapy to scrape a web site. I'm stuck at defining properly the rule for extracting links. Specifically, I need help to write a regular expression that allows urls like:

https://discuss.dwolla.com/t/the-dwolla-reflector-is-now-open-source/1352 https://discuss.dwolla.com/t/enhancement-dwolla-php-updated-to-2-1-3/1180 https://discuss.dwolla.com/t/updated-java-android-helper-library-for-dwollas-api/108

while forbidding urls like this one

https://discuss.dwolla.com/t/the-dwolla-reflector-is-now-open-source/1352/12

In other words, I want urls that end with digits (i.e., /1352 in the example abpve), unless after these digits there is anything after (i.e., /12 in the example above)

I am by no means an expert of regular expressions, and I could only come up with something like \/(\d+)$, or even this one ^https:\/\/discuss.dwolla.com\/t\/\S*\/(\d+)$, but both fail at excluding the unwanted urls since they all capture the last digits in the address.

--- UPDATE ---

Sorry for not being clear in the first place. This addition is to clarify that the digits at the of URLS can change, so the /1352 is not fixed. As such, another example of urls to be accepted is also:

https://discuss.dwolla.com/t/updated-java-android-helper-library-for-dwollas-api/108

bateman
  • 467
  • 5
  • 13

2 Answers2

2

This is probably the simplest way:

[^\/\d][^\/]*\/\d+$

or to restrict to a particular domain:

^https?:\/\/discuss.dwolla.com\/.*[^\/\d][^\/]*\/\d+$

See live demo.

This regex requires the last part to be all digits, and the 2nd last part to have at least 1 non-digit.

Bohemian
  • 412,405
  • 93
  • 575
  • 722
  • Thanks for your help. Saw the live demo. Your regex does not seem to work with this one though. `https://discuss.dwolla.com/t/enhancement-dwolla-php-updated-to-2-1-3/1180`. Is it because it contains numbers in the middle, other than letters and dashes? See [here](http://rubular.com/r/2Bd65ei3If), I've added more examples to your live demo. – bateman Jul 06 '15 at 17:17
  • @bateman I see. Hopefully that last edit is more to your liking (new live demo link too). Thanks for making the job easier by augmenting the demo and posting the new link. – Bohemian Jul 06 '15 at 17:30
  • Thanks! This seems to work now! Running the script and get back here right after. Cheers! – bateman Jul 06 '15 at 17:33
0

Here is a java regex may fit your requirements in java style. You can specify number of digits N you are excepting in {N}

^https://discuss.dwolla.com/t/[\\w|-]+/[\\d]+$
Puneeth Reddy V
  • 1,538
  • 13
  • 28