Using PHP to extract specific data from websites

Question

I am new in PHP and I was looking to extract data like inventory quantity and sizes from different websites. Was kind of confused on how I would go about doing this. Would Domdocument be the way to go?

Not sure if that was the best method for this.

I was attempting from lines 164-174 on here.

Any help is greatly appreciated!

EDIT - this is my updated code. Dont really think its the most efficient way to do things though.

<html>
<?php



$url = 'https://kithnyc.com/collections/adidas/products/kith-x-adidas-    consortium-response-trail-boost?variant=35276776455';
$html = file_get_contents($url);

//preg_match('~itemprop="image"\scontent="(\w+.\w+.\w+.\w+.\w+.\w+)~',     $html, $image);
//$image = $image[1];

preg_match('~,"title":"(\w+.\w+.\w+.\w+.\w+.\w+)~', $html, $title);
$title = $title[1];


preg_match_all('~{"id":(\d+)~', $html, $id);
$id = $id[1];

preg_match_all('~","public_title":"(\d+..)~', $html, $size);
$size = $size[1];

preg_match_all('~inventory_quantity":(\d+)~', $html, $quantity);
$quantity = $quantity[1];


function plain_url_to_link($url) {
return preg_replace(
    '%(https?|ftp)://([-A-Z0-9./_*?&;=#]+)%i',
    '<a target="blank" rel="nofollow" href="$0"      target="_blank">$0</a>', $url);
}



$i = 0;
$j = 2;

echo "$title<br />";
echo "<br />";

//echo $image;

echo plain_url_to_link($url);
echo "<br />";
echo "<br />";

for($i = 0; $i < 18; $i++) {
print "Size: $size[$i] --- Quantity: $quantity[$i] --- ID: $id[$j]";
$j++;
echo "<br />";
}


echo "<br />";
//print_r($quantity);




?>
</body>
</html>

Can you include any code that you have so far? Have you made attempts at it yet? — Vandal, Dec 24 '16 at 03:21
If, you want to get data from other website. Then use php native Dom or use simple html dom here http://simplehtmldom.sourceforge.net — Kumar, Dec 24 '16 at 03:24
Possible duplicate of [How do you parse and process HTML/XML in PHP?](http://stackoverflow.com/questions/3577641/how-do-you-parse-and-process-html-xml-in-php) — chris85, Dec 24 '16 at 04:25
I cant figure out how to extract using Domdocument. What would the tagname/class/or ID be for lines 164-174 in the link I provided? — Jose Paz, Dec 24 '16 at 07:17
Please add the relevant code to the question. If you're trying DOMDocument (and you should), post that code, instead of the earlier attempt using regex. — GolezTrol, Dec 24 '16 at 09:18
Regex is good/bad for parsing (X)HTML as mush as any other method. What you have to use depends of complexity of your task. Most of time regex is the best and the easiest solution, but for complex tasks writing huge patterns is very confusing for less experienced 'players' and some other parsing methods come as better solution. Sometimes Regex is pain in the a**, again, really depends of you only and your skills. — Wh1T3h4Ck5, Dec 24 '16 at 09:32

score 2 · Accepted Answer · edited May 23 '17 at 12:33

As a general rule of thumb, you must avoid parsing HTML/XML content with regular expressions. Here's why:

Entire HTML parsing is not possible with regular expressions, since it depends on matching the opening and the closing tag which is not possible with regexps.

Regular expressions can only match regular languages but HTML is a context-free language. The only thing you can do with regexps on HTML is heuristics but that will not work on every condition. It should be possible to present a HTML file that will be matched wrongly by any regular expression.

— https://stackoverflow.com/a/590789/65732

Use a DOM parser instead which is specifically designed for the purpose of parsing HTML/XML documents. Here's an example:

# Installing Symfony's dom parser using Composer
composer require symfony/dom-crawler symfony/css-selector

<?php

require 'vendor/autoload.php';

use Symfony\Component\DomCrawler\Crawler;

$html = file_get_contents('https://kithnyc.com/collections/footwear/products/kith-x-adidas-consortium-response-trail-boost?variant=35276776455');

$crawler  = new Crawler($html);
$price    = $crawler->filter('.product-header-title[itemprop="price"]')->text();
// UPDATE: Does not work! as the page updates the button text 
// later with javascript. Read more for another solution.
$in_stock = $crawler->filter('#AddToCartText')->text();

if ($in_stock == 'Sold Out') {
    $in_stock = 0; // or `false`, if you will
}

echo "Price: $price - Availability: $in_stock";
// Outputs:
// Price: $220.00 - Availability: Buy Now
// We'll fix "Availability" later...

Using such parsers, you have the ability to extract elements using XPath as well.

But if you want to parse the javascript code included in that page, you'd better use a browser emulator like Selenium. Then you have programmatic access to all the globally available javascript vars/functions in that page.

Update

Getting the price

So you were getting this error running the above code:

PHP Fatal error:
Uncaught Symfony\Component\CssSelector\Exception\SyntaxErrorException: Expected identifier, but found.

That's because the target page uses an invalid class name for the price element (.-price) and this Symfony's CSS selector component cannot parse it correctly, hence the exception. Here's the element:

<span id="ProductPrice" class="product-header-title -price" itemprop="price" content="220">$220.00</span>

To workaround it, let's use the itemprop attribute instead. Here's the selector that can match it:

.product-header-title[itemprop="price"]

I updated the above code accordingly to reflect it. I tested it and it's working for the price part.

Getting the stock status

Now that I actually tested the code, I see that the stock status of products is set later using javascript. It's not there when you fetch the page using file_get_contents(). You can see it for yourself, refresh the page, the button appears as Buy Now, then a second later it changes to Sold Out.

But fortunately, the quantity of the product variant is buried deep somewhere in the page. Here's a pretty printed copy of the huge object Shopify uses to render the product pages.

So now the problem is parsing javascript code with PHP. There are a few general approaches to tackle the problem:

Feel free to skip these approaches as they are not specific to your problem. Jump straight to number 6, if you just want a solution to your question.

The most reliable and common approach is to scrape data from such sites (that heavily rely on javascript) is to use a browser emulator like Selenium which are able to execute javascript code. Have a look at Facebook's PHP WebDriver package which is the most sophisticated PHP binding for Selenium WebDriver available. It provides you with an API to remotely control web browsers and execute javascript against them.

Also, see Behat's Mink that comes with various drivers for both headless browsers as well as full-fledged browser controllers. The drivers include Goutte, BrowserKit, Selenium1/2, Zombie.js, Sahi and WUnit.
See V8js, the PHP extension; which embeds V8 javascript engine into PHP. It allows you to evaluate javascript code right from your PHP script. But it's a little bit overkill to install a PHP extension if you're not heavily using the feature. But if you want to extract the relevant script using the DOM parser:

$script = $crawler->filterXPath('//head/following-sibling::script[2]')->text();
Use HtmlUnit to parse the page and then feed the final HTML to PHP. You gonna need a small Java wrapper. Right, overkill for your case.
Extract the javascript code and parse it using a JS parser/tokenizer library like hiltonjanfield/js4php5 or squizlabs/PHP_CodeSniffer which has a JS tokenizer.
In case that the application is making ajax calls to manipulate the DOM. You might be able to re-dispatch those requests and parse the response for your own application's sake. An example is the ajax call the page is making to cart.js to retrieve the data related to the cart items. But it's not the case for reading the product variant quantity here.
You may recall that I told you that it's a bad idea to utilize regular expressions to parse entire HTML/XML documents. But it's OK to use them partially to extract strings from an HTML/XML document when other approaches are even harder. Read the SO answer I quoted at the top of this post if you have any confusions about when to use it.

This approach is about matching the inventory_quantity of the product variant by running a simple regex against the whole page source (or you can only execute it against the script tag regarding a better performance):

<?php

require 'vendor/autoload.php';

use Symfony\Component\DomCrawler\Crawler;

$html = file_get_contents('https://kithnyc.com/collections/footwear/products/kith-x-adidas-consortium-response-trail-boost?variant=35276776455');

$crawler  = new Crawler($html);
$price    = trim($crawler->filter('.product-header-title[itemprop="price"]')->text());

preg_match('/35276776455,.+?inventory_quantity":(\d)/', $html, $in_stock);
$in_stock = $in_stock[1];

echo "Price: $price - Availability: $in_stock";
// Outputs:
// Price: $220.00 - Availability: 0

This regex needs a variant ID (35276776455 in this case) to work, as the quantity of each product comes with a variant. You can extract it from the URL's query string: ?variant=35276776455.

Now that we're done with the stock status and we've done it with regex, you might want to do the same with the price and drop the DOM parser dependency:

<?php

$html = file_get_contents('https://kithnyc.com/collections/footwear/products/kith-x-adidas-consortium-response-trail-boost?variant=35276776455');

// You need to check if it's matched before assigning 
// $price[1]. Anyway, this is just an example.
preg_match('/itemprop="price".+?>\s*\$(.+?)\s*<\/span>/s', $html, $price);
$price = $price[1];

preg_match('/35276776455,.+?inventory_quantity":(\d)/', $html, $in_stock);
$in_stock = $in_stock[1];

echo "Price: $price - Availability: $in_stock";
// Outputs:
// Price: $220.00 - Availability: 0

Conclusion

Even though that I still believe that it's a bad idea to parse HTML/XML documents with regex, I must admit that available DOM parsers are not able to parse embedded javascript code (and probably will never be), which is your case. We can partially utilize regular expressions to extract strings from HTML/XML; the parts which are not parsable using DOM parsers. So, all in all:

Use DOM parsers to parse/scrape the HTML code that initially exists in the page.
Intercept ajax calls that may include information you want. Re-call them in a separate http request to get the data.
Use browser emulators for parsing/scraping JS-heavy sites that populate their pages using ajax calls and such.
Partially use regex to extract what is not extractable using DOM parsers.

If you just want these two fields, you're fine to go with regex. Otherwise, consider other approaches.

[Regex](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) — GolezTrol, Dec 24 '16 at 09:20
Thanks a lot for the response it cleared some things up. I attempted this but am getting an error I cant seem to figure out. I'll edit in the post — Jose Paz, Dec 25 '16 at 03:19
Thank you so much. I really was not using the object that shopify was using to render the pages. Now, would each variant be a different shoe? or are variants different for every size? I was attempting to sort of make a table or chart kind of thing that would list the sizes in one column and quantity in a separate column. — Jose Paz, Dec 25 '16 at 21:11
Also, if I am not able to really parse with DOM parsers in this case and I am only able to partially parse using regex, would there anything else that you suggest would be better/simpler? — Jose Paz, Dec 25 '16 at 21:13
Variants, as the name suggests, are variants of a single product. Different sizes, different colors, etc. They might have different prices, different inventory quantities, some might be available to purchase and some might not. Have a look at the `variants` key in the [`product` object](https://hastebin.com/raw/kolisiweja). — sepehr, Dec 26 '16 at 06:31
Regarding your second comment, I already included all other possible approaches in the answer. It all depends on the extents of the web scraping operation you're doing. Parse anything available in the page source using a DOM parser with CSS or XPath, unless it's not parseable (e.g. javascript code). Use browser emulators to interact with js-heavy sites that populate the page with ajax requests. Partially use regex as your last resort. — sepehr, Dec 26 '16 at 06:34
I ended up using regex for just about the whole thing and got to where was trying to get. I just ended up using a for loop to loop through all the sizes, quantities, etc. Not sure if it was the best approach but I almost complete what I wanted to. I will post up my new code to see what you think about it. I would really appreciate your input — Jose Paz, Dec 27 '16 at 03:58
As long as it gets the job done and it's not too hard to implement. Well done. — sepehr, Dec 27 '16 at 05:19
I really appreciate your help. Probably would not have gotten this without it. Or at least not anytime soon. Thank you again! — Jose Paz, Dec 27 '16 at 05:53

Using PHP to extract specific data from websites

1 Answers1

Update

Getting the price

Getting the stock status

Conclusion