Regex to get ID from link URL

Question

I have links like this:

<div class="zg_title">
  <a href="https://rads.stackoverflow.com/amzn/click/com/B000O3GCFU" rel="nofollow noreferrer">Thermos Foogo Leak-Proof Stainless St...</a>     
</div>

And I'm scraping them like this:

  product_asin = product.xpath('//div[@class="zg_title"]/a/@href').first.value

The problem is that it takes the whole URL and I want to just get the ID:

B000O3GCFU

I think I need to do something like this:

product_asin = product.xpath('//div[@class="zg_title"]/a/@href').first.value[ReGEX_HERE]

What's the simplest regex I can use in this case?

EDIT:

Strange the link URL doesn't appear complete:

http://www.amazon.com/Thermos-Foogo-Leak-Proof-Stainless-10-Ounce/dp/B000O3GCFU/ref=zg_bs_baby-products_1

You should not use regex: http://stackoverflow.com/a/1732454/2027232 — string.Empty, Aug 26 '13 at 07:24
@Nicolas Tyler: The regex is being used on an extracted string, where it is probably fine. Although passing it through URI.parse first: e.g. `URI.parse("http://rads.stackoverflow.com/amzn/click/B000O3GCFU").path` should extract an even easier string. — Neil Slater, Aug 26 '13 at 07:28
To use a regex, there has to be a common pattern. What is the common pattern in your url's? Posting one url does not in any way help identify a pattern. — 7stud, Aug 26 '13 at 07:34
@Neil Slater even if you get the URL alone there is no guarantee that the code in the link can be found with regex — string.Empty, Aug 26 '13 at 07:38
@Nicolas Tyler: True, but there is also not really such a thing as a parser for meaningful directory names in a path. Best you can do with parsers is split the path into components. Still got to identify the correct component. This may be formalised as e.g. third part of path in Amazon's documentation, in which case yes you could forgo regex. — Neil Slater, Aug 26 '13 at 08:14

falsetru · Accepted Answer · 2013-08-26T07:30:22.667

3

Use /\w+$/:

p doc.xpath('//div[@class="zg_title"]/a/@href').first.value[/\w+$/]

/\w+$/ matches trailing alphabets, digits, _.

require 'nokogiri'

s = <<EOF
<div class="zg_title">
  <a href="http://rads.stackoverflow.com/amzn/click/B000O3GCFU">Thermos Foogo Leak-Proof Stainless St...</a>     
</div>
EOF

doc = Nokogiri::HTML(s)
p doc.xpath('//div[@class="zg_title"]/a/@href').first.value[/\w+$/]
# => "B000O3GCFU"

edited Aug 26 '13 at 07:30

answered Aug 26 '13 at 07:20

falsetru

357,413
63
732
636

Thanks but the result was: `products_1` – alexchenco Aug 26 '13 at 07:29
@alexchenco, Using the given html, I got `B000O3GCFU`. – falsetru Aug 26 '13 at 07:30
Sorry, I'm not sure what happened. But SO editor cut the last part. I added it in the **EDIT**. – alexchenco Aug 26 '13 at 07:32
@alexchenco, Try `/(?<=\/)[A-Z\d]{5,}/` instead. I assume that product name consists of uppercase letters and digits (at least 5 character long, preceded by `/`). – falsetru Aug 26 '13 at 07:33
@NicolasTyler, regexpal.com is Javascript-based. It does not support lookbehind. – falsetru Aug 26 '13 at 07:46

Lars Haugseth · Answer 2 · 2013-08-26T09:17:22.540

3

Given that the product code is always preceded by /dp/ and followed by a /:

url[/(?<=\/dp\/)[^\/]+/]

Or, perhaps more readable:

url[%r{(?<=/dp/)[^/]+}]

Alternatively, without using regular expressions:

parts = url.split('/')
parts[parts.index('dp') + 1]

edited Aug 26 '13 at 09:17

answered Aug 26 '13 at 08:18

Lars Haugseth

14,721
2
45
49

score 0 · Answer 3 · answered Aug 26 '13 at 08:24

An approach based on available parsers (to please Nicolas Tyler or anyone else who would rather avoid regex for parsing in this sort of case)

require 'uri'

product_uri = product.xpath('//div[@class="zg_title"]/a/@href').first.value
# e.g. http://www.amazon.com/Thermos-Foogo-Leak-Proof-Stainless-10-Ounce/dp/B000O3GCFU/ref=zg_bs_baby-products_1

product_path = URI.parse( product_asin_uri ).path.split('/')
# => ["", "Thermos-Foogo-Leak-Proof-Stainless-10-Ounce", 
#     "dp", "B000O3GCFU", "ref=zg_bs_baby-products_1"]

# This relies on (un-researched assumption) location in path being consistent
# Now we have components though, we can look at Amazon's documentation and 
# select based on position in path, relative position from some other identifier
# etc, without risk of a regex mismatch

product_asin = product_path[2]
# => "B000O3GCFU"

Regex to get ID from link URL

3 Answers3