0

I'm working on a PHP project for school. The task is to build a website to grab and analyze data from another website. I have the framework set up, and I am able to grab certain data from the desired site, but I can't seem to get the syntax right for other data that I need to obtain.

For example, the site that I am currently analyzing is a page for a specific item returned from a search of Amazon.com (e.g. search amazon.com for "iPad" and pick the first result). I am able to grab the title of the product's page, but I need to grab the review count and the price, and therein lies the issue. I'm using preg_match to get the title (works fine), but I'm not able to get the reviews nor the price. I continue to get the Undefined Offset error, which I've discovered means that there is nothing being returned that matches the given criterion. Simply checking to see whether something has been returned will not help me, since I need to obtain these data for my analysis. The 's that I'm trying to mine are unique on the page, so there is only one instance of each.

The Page Source for my product page contains the following snippits of HTML that I need to grab. (The website can, and needs to be able to handle, anything, but for this example, I searched "iPad").

<span id="priceblock_ourprice" class="a-size-medium a-color-price">$397.74</span>

I need the 397.74.

<span id="acrCustomerReviewText" class="a-size-base">1,752 customer reviews</span>

I need the 1,752.

I've tried all combinations of escape characters, wildcards, etc., but I can't seem to get beyond the Undefined Offset error. An example of my code is as follows where $link is the URL, and $f is an empty array in which I want to store the result (Note: There is NOT a space after the '<' in "< span..." It just erased everything up to the "...(.*)..." when I typed it as "< span..." without the space):

preg_match("#\< span id\=\"priceblock\_ourprice\" class\=\"a\-size\-medium a\-color\-price\"\>(.*)\<\/span\>#", file_get_contents($link), $f);

$price=$f[1]; //Offset error occurs on this line

echo $price;

Please help. I've been beating my head against this for the past two days now. I'm hoping I'm just doing something stupid. This is my first experience with preg_match and data mining. Thank you much in advanced for your time and assistance.

carloabelli
  • 4,289
  • 3
  • 43
  • 70

1 Answers1

0

Code

As stated by @cabellicar123, you shouldn't use regex with html. I believe what you are looking for is strpos() and substr(). It should look something like this:

function get_content($string, $begintag, $endtag) {
  if (strpos($string, $begintag) !== False) {
    $location = strpos($string, $begintag) + strlen($begintag);
    $leftover = substr($string, $location);
    $contents = substr($leftover, 0, strpos($leftover, $endtag));
    return $contents;
  }
}
// Usage (Change the variables):
$str = file_get_contents('http://www.amazon.com/OLB3-Official-League-Recreational-Ball/dp/B004KOBRMC/');
$beg = '<b class="priceLarge">$';
$end = '</b>';
get_content($str, $beg, $end);

I've provided a working example which would return the price of the object on the page, in this case, the price of a rawlings baseball.

Explanation

I'll go through the code, line by line, and explain every piece.

function get_content($string, $begintag, $endtag)

$string is the string being searched through (in this case an amazon page), $begintag is the opening tag of the element being searched for, and $closetag is the closing tag of that element. NOTE: This will only use the first instance of the opening tag, more than that will be ignored.

if (strpos($string, $begintag) !== False)

Checks if the beginning tag actually exists. Note the !== False; that's because strpos can return 0, which evaluates to False.

$location = strpos($string, $begintag) + strlen($begintag);

strpos() will return the first instance of $begintag in $string, therefore the length of the $begintag must be added to the strpos() to get the location of the end of $begintag.

$leftover = substr($string, $location);

Now that we have the $location of the opening tag, we need to narrow the $string down by setting $leftover to the part of the $string after $location.

$contents = substr($leftover, 0, strpos($leftover, $endtag));

This gets the position of the $endtag in $leftover, and stores everything before that $endtag in $contents.

As for the last few lines of code, they are specific to this example and just need to be changed to fit the circumstances.

Community
  • 1
  • 1
zFurman
  • 1
  • 2
  • Thanks for the hasty reply. I was finally able to attempt your solution, but it is not returning anything. I copied your code verbatim (will change it once it works), but it is not returning anything. Through print statements, I was able to discover that it made it inside the IF statement in the get_content function, but it is not printing anything for $leftover, $contents, nor $price where $price=get_contents(...). I confirmed that the $beg string is in fact on the page that I am searching (it is in the page source exactly once). Am I missing something? – user3912483 Aug 11 '14 at 23:14
  • I changed the beginning tag to the acrCustomerReview one, and I met with the same result. It is not displaying anything for the return of the function, even when the result is saved to a variable. What am I not doing correctly? – user3912483 Aug 11 '14 at 23:40
  • I'm new to this site and just noticed the "@[name]" feature, so I'm tagging you as well, @zFurman, in case it helps get these comments to you faster. – user3912483 Aug 12 '14 at 01:31
  • it seems that the strpos() call is not returning anything, i.e. it is returning null. I can see the exact line of html code that I'm searching for in the page source. It is definitely there. Why is the code (your example) returning null? – user3912483 Aug 12 '14 at 01:54
  • Sorry for so many questions, but I also wanted to know the purpose behind adding the length of the entire string in the conditional of the function. Assuming $string has the html for the entire page (string length of, say, 1500), and the beginning tag occurs at position 500 (for example), wouldn't the conditional be setting $location to 500 + 1500, which would start it at 2000? If that's the case, then $leftover would be searching for the rest of the entire html code (1500) starting at $location (2000), would it not? I must be confused with the purpose of "+ strlen($string)" – user3912483 Aug 12 '14 at 02:39
  • Sorry for the delay. I'll try to answer as many of your questions as I can. I'll start with your first questions and move on to later ones as I go. @user3912483 – zFurman Aug 13 '14 at 03:30
  • I tested the code before I posted it, so I assumed it worked. However, after your comments, I retested the code, and found errors which I fixed. I also changed the code to include an example page which it would return the price from. I edited my post to include the new code. @user3912483 – zFurman Aug 13 '14 at 08:15
  • Ok, it seems my previous answer renders all of your other questions unnecessary (if not, tell me) but if you have anymore, feel free to ask. @user3912483 – zFurman Aug 13 '14 at 08:20