0

I try to extract some information with curl command With a simple grep I extract the title:

grep -o "<title>[^<]*" | sed -e 's/<[^>]*>//g'

but I would like to extract the price of the product. If I check the code I see this inside of this content:

$(document).ready(function(){$('.sequra-product-price-js').text('27,62 €');$('.sequra-product-price-js').attr('content','27.62');$('.descuento_marca_producto').html

How can extract the price?

This is for example an URL:

curl -k https://bulevip.com/es/pre-entreno/20927-cellucor-c4-original-pre-workout-390-gr-60-servicios.html

thanks!

Guif If
  • 535
  • 2
  • 7
  • 18
  • your title extractor will fail on any HTML-encoded characters. for example, if the title is `blåbærsyltetøy`, the correct translation is `blåbærsyltetøy` (norwegian for `blueberry jam`), but your extractor will end up with `blåbærsyltetøy`, completely unreadable. it will also fail if the title includes special characters like `&` or `<` or `>` or `^` - to get the correct translation, you could instead do: `php -r 'echo (@DOMDocument::loadHTML(stream_get_contents(STDIN)))->getElementsByTagName("title")->item(0)->textContent;'` – hanshenrik Aug 18 '19 at 08:18

1 Answers1

1

php-cli's DOMDocument + DOMXPath can easily extract the price,

curl -ks https://bulevip.com/es/pre-entreno/20927-cellucor-c4-original-pre-workout-390-gr-60-servicios.html | php -r 'echo (new DOMXPath(@DOMDocument::loadHTML(stream_get_contents(STDIN))))->query("//span[contains(@class,\"product-price-js\")]")->item(0)->getAttribute("content");'

btw you should not parse HTML with regex.

for example, you say that you already have title extraction working with

grep -o "<title>[^<]*" | sed -e 's/<[^>]*>//g'

but it is flawed, it will fail on any HTML-encoded characters. for example, if the title is <title>bl&aring;b&aelig;rsyltet&oslash;y</title>, the correct translation is blåbærsyltetøy (norwegian for blueberry jam), but your extractor will end up with bl&aring;b&aelig;rsyltet&oslash;y, which is completely unreadable. it will also fail if the title includes special characters like & or < or > or ^ - to get the correct translation, you could instead do:

php -r 'echo (@DOMDocument::loadHTML(stream_get_contents(STDIN)))->getElementsByTagName("title")->item(0)->textContent;'

which will correctly translate any html-encoded character :)

and if we put it to the test:

$ echo '<title>bl&aring;b&aelig;rsyltet&oslash;y</title>' > html

$ cat html | grep -o "<title>[^<]*" | sed -e 's/<[^>]*>//g'
bl&aring;b&aelig;rsyltet&oslash;y

$ cat html | php -r 'echo (@DOMDocument::loadHTML(stream_get_contents(STDIN)))->getElementsByTagName("title")->item(0)->textContent;'
blåbærsyltetøy

$

or if the title is AT&T (the world's largest telecom company, from USA), which must be encoded as AT&amp;T,

$ echo '<title>AT&amp;T</title>' > html

$ cat html | grep -o "<title>[^<]*" | sed -e 's/<[^>]*>//g'
AT&amp;T

$ cat html | php -r 'echo (@DOMDocument::loadHTML(stream_get_contents(STDIN)))->getElementsByTagName("title")->item(0)->textContent;'
AT&T
hanshenrik
  • 19,904
  • 4
  • 43
  • 89