Extract price with curl command

Question

I try to extract some information with curl command With a simple grep I extract the title:

grep -o "<title>[^<]*" | sed -e 's/<[^>]*>//g'

but I would like to extract the price of the product. If I check the code I see this inside of this content:

$(document).ready(function(){$('.sequra-product-price-js').text('27,62 €');$('.sequra-product-price-js').attr('content','27.62');$('.descuento_marca_producto').html

How can extract the price?

This is for example an URL:

curl -k https://bulevip.com/es/pre-entreno/20927-cellucor-c4-original-pre-workout-390-gr-60-servicios.html

thanks!

your title extractor will fail on any HTML-encoded characters. for example, if the title is `blåbærsyltetøy`, the correct translation is `blåbærsyltetøy` (norwegian for `blueberry jam`), but your extractor will end up with `blåbærsyltetøy`, completely unreadable. it will also fail if the title includes special characters like `&` or `<` or `>` or `^` - to get the correct translation, you could instead do: `php -r 'echo (@DOMDocument::loadHTML(stream_get_contents(STDIN)))->getElementsByTagName("title")->item(0)->textContent;'` — hanshenrik, Aug 18 '19 at 08:18

hanshenrik · Accepted Answer · 2019-08-18T08:39:46.000

php-cli's DOMDocument + DOMXPath can easily extract the price,

curl -ks https://bulevip.com/es/pre-entreno/20927-cellucor-c4-original-pre-workout-390-gr-60-servicios.html | php -r 'echo (new DOMXPath(@DOMDocument::loadHTML(stream_get_contents(STDIN))))->query("//span[contains(@class,\"product-price-js\")]")->item(0)->getAttribute("content");'

btw you should not parse HTML with regex.

for example, you say that you already have title extraction working with

grep -o "<title>[^<]*" | sed -e 's/<[^>]*>//g'

but it is flawed, it will fail on any HTML-encoded characters. for example, if the title is <title>blåbærsyltetøy</title>, the correct translation is blåbærsyltetøy (norwegian for blueberry jam), but your extractor will end up with blåbærsyltetøy, which is completely unreadable. it will also fail if the title includes special characters like & or < or > or ^ - to get the correct translation, you could instead do:

php -r 'echo (@DOMDocument::loadHTML(stream_get_contents(STDIN)))->getElementsByTagName("title")->item(0)->textContent;'

which will correctly translate any html-encoded character :)

and if we put it to the test:

$ echo '<title>bl&aring;b&aelig;rsyltet&oslash;y</title>' > html

$ cat html | grep -o "<title>[^<]*" | sed -e 's/<[^>]*>//g'
bl&aring;b&aelig;rsyltet&oslash;y

$ cat html | php -r 'echo (@DOMDocument::loadHTML(stream_get_contents(STDIN)))->getElementsByTagName("title")->item(0)->textContent;'
blåbærsyltetøy

$

or if the title is AT&T (the world's largest telecom company, from USA), which must be encoded as AT&T,

$ echo '<title>AT&amp;T</title>' > html

$ cat html | grep -o "<title>[^<]*" | sed -e 's/<[^>]*>//g'
AT&amp;T

$ cat html | php -r 'echo (@DOMDocument::loadHTML(stream_get_contents(STDIN)))->getElementsByTagName("title")->item(0)->textContent;'
AT&T

Extract price with curl command

1 Answers1