php-cli's DOMDocument + DOMXPath can easily extract the price,
curl -ks https://bulevip.com/es/pre-entreno/20927-cellucor-c4-original-pre-workout-390-gr-60-servicios.html | php -r 'echo (new DOMXPath(@DOMDocument::loadHTML(stream_get_contents(STDIN))))->query("//span[contains(@class,\"product-price-js\")]")->item(0)->getAttribute("content");'
btw you should not parse HTML with regex.
for example, you say that you already have title extraction working with
grep -o "<title>[^<]*" | sed -e 's/<[^>]*>//g'
but it is flawed, it will fail on any HTML-encoded characters. for example, if the title is <title>blåbærsyltetøy</title>
, the correct translation is blåbærsyltetøy
(norwegian for blueberry jam), but your extractor will end up with blåbærsyltetøy
, which is completely unreadable. it will also fail if the title includes special characters like &
or <
or >
or ^
- to get the correct translation, you could instead do:
php -r 'echo (@DOMDocument::loadHTML(stream_get_contents(STDIN)))->getElementsByTagName("title")->item(0)->textContent;'
which will correctly translate any html-encoded character :)
and if we put it to the test:
$ echo '<title>blåbærsyltetøy</title>' > html
$ cat html | grep -o "<title>[^<]*" | sed -e 's/<[^>]*>//g'
blåbærsyltetøy
$ cat html | php -r 'echo (@DOMDocument::loadHTML(stream_get_contents(STDIN)))->getElementsByTagName("title")->item(0)->textContent;'
blåbærsyltetøy
$
or if the title is AT&T
(the world's largest telecom company, from USA), which must be encoded as AT&T
,
$ echo '<title>AT&T</title>' > html
$ cat html | grep -o "<title>[^<]*" | sed -e 's/<[^>]*>//g'
AT&T
$ cat html | php -r 'echo (@DOMDocument::loadHTML(stream_get_contents(STDIN)))->getElementsByTagName("title")->item(0)->textContent;'
AT&T