1

I have links like this:

<div class="zg_title">
  <a href="https://rads.stackoverflow.com/amzn/click/com/B000O3GCFU" rel="nofollow noreferrer">Thermos Foogo Leak-Proof Stainless St...</a>     
</div>

And I'm scraping them like this:

  product_asin = product.xpath('//div[@class="zg_title"]/a/@href').first.value 

The problem is that it takes the whole URL and I want to just get the ID:

B000O3GCFU

I think I need to do something like this:

product_asin = product.xpath('//div[@class="zg_title"]/a/@href').first.value[ReGEX_HERE]

What's the simplest regex I can use in this case?

EDIT:

Strange the link URL doesn't appear complete:

http://www.amazon.com/Thermos-Foogo-Leak-Proof-Stainless-10-Ounce/dp/B000O3GCFU/ref=zg_bs_baby-products_1
alexchenco
  • 53,565
  • 76
  • 241
  • 413
  • 1
    You should not use regex: http://stackoverflow.com/a/1732454/2027232 – string.Empty Aug 26 '13 at 07:24
  • 1
    @Nicolas Tyler: The regex is being used on an extracted string, where it is probably fine. Although passing it through URI.parse first: e.g. `URI.parse("http://rads.stackoverflow.com/amzn/click/B000O3GCFU").path` should extract an even easier string. – Neil Slater Aug 26 '13 at 07:28
  • 1
    To use a regex, there has to be a common pattern. What is the common pattern in your url's? Posting one url does not in any way help identify a pattern. – 7stud Aug 26 '13 at 07:34
  • @Neil Slater even if you get the URL alone there is no guarantee that the code in the link can be found with regex – string.Empty Aug 26 '13 at 07:38
  • @Nicolas Tyler: True, but there is also not really such a thing as a parser for meaningful directory names in a path. Best you can do with parsers is split the path into components. Still got to identify the correct component. This may be formalised as e.g. third part of path in Amazon's documentation, in which case yes you could forgo regex. – Neil Slater Aug 26 '13 at 08:14

3 Answers3

3

Use /\w+$/:

p doc.xpath('//div[@class="zg_title"]/a/@href').first.value[/\w+$/]

/\w+$/ matches trailing alphabets, digits, _.


require 'nokogiri'

s = <<EOF
<div class="zg_title">
  <a href="http://rads.stackoverflow.com/amzn/click/B000O3GCFU">Thermos Foogo Leak-Proof Stainless St...</a>     
</div>
EOF

doc = Nokogiri::HTML(s)
p doc.xpath('//div[@class="zg_title"]/a/@href').first.value[/\w+$/]
# => "B000O3GCFU"
falsetru
  • 357,413
  • 63
  • 732
  • 636
3

Given that the product code is always preceded by /dp/ and followed by a /:

url[/(?<=\/dp\/)[^\/]+/]

Or, perhaps more readable:

url[%r{(?<=/dp/)[^/]+}]

Alternatively, without using regular expressions:

parts = url.split('/')
parts[parts.index('dp') + 1]
Lars Haugseth
  • 14,721
  • 2
  • 45
  • 49
0

An approach based on available parsers (to please Nicolas Tyler or anyone else who would rather avoid regex for parsing in this sort of case)

require 'uri'

product_uri = product.xpath('//div[@class="zg_title"]/a/@href').first.value
# e.g. http://www.amazon.com/Thermos-Foogo-Leak-Proof-Stainless-10-Ounce/dp/B000O3GCFU/ref=zg_bs_baby-products_1

product_path = URI.parse( product_asin_uri ).path.split('/')
# => ["", "Thermos-Foogo-Leak-Proof-Stainless-10-Ounce", 
#     "dp", "B000O3GCFU", "ref=zg_bs_baby-products_1"]

# This relies on (un-researched assumption) location in path being consistent
# Now we have components though, we can look at Amazon's documentation and 
# select based on position in path, relative position from some other identifier
# etc, without risk of a regex mismatch

product_asin = product_path[2]
# => "B000O3GCFU"
Neil Slater
  • 26,512
  • 6
  • 76
  • 94