3

I'm successfully scraping a website to get space separated data off of the page:

$html = file_get_contents("http://www.somewebsite.com");
$scores_doc = new DOMDocument();

$scores_doc->loadHTML($html);
$scores_path = new DOMXPath($scores_doc);
$scores_row  = $scores_xpath->query('//td[@class="first"]');

foreach($scores_row as $row){
    echo $row->nodeValue . "<br/>";
}

Example output:

23 Crimmons, Bob (CA)
48 Silas, Greg (RI)
82 Huston, Roger (TX)
21 Lester, Terry (NC)

Instead of printing the output using 'echo' I need to split the value into four smaller pieces and into variables (array or otherwise). I know the MySQL side very well, I just don't use PHP day to day. I tried (in place of the 'echo' and after defining it as an array):

$data[] = echo $row->nodeValue;
Bob Ortiz
  • 452
  • 1
  • 3
  • 20
user3741598
  • 297
  • 1
  • 12

1 Answers1

4

A sidenote on the used syntax: If you just want to assign the whole 23 Crimmons, Bob (CA) string as one string to an array. You should use the right syntax.

$data[] = echo $row->nodeValue;

Should be:

$data[] = $row->nodeValue;

Three possible solutions to your problem.

Solution 1: Improve scraping

The best way to scrape those four values seperately would be to query more specifically. You can try to update your xpath query on line:

$scores_xpath->query('//td[@class="first"]');

The query you can use depends on the structure of the page you're scraping.

Solution 2: Splitting string using PHP explode

You could use PHP's explode function to separate the string, but note that will give some problems when there are spaces used in a name.

echo $row->nodeValue . "<br/>";

Can be something like:

// Assuming that $row->nodeValue will have the string `23 Crimmons, Bob (CA)` as it's value 
$explodeRow = explode(' ', $row->nodeValue);

/*
* $explodeRow now contains four values. 
*
* $explodeRow[0] = "23";
* $explodeRow[1] = "Crimmons,";
* $explodeRow[2] = "Bob";
* $explodeRow[3] = "(CA)";
*/

You can choose to remove the ( and ) characters in $explodeRow[3] with the PHP str_replace, preg_replace or substr function for example.

Solution 3: Splitting string using regular expressions

Alternatively you can decide to fetch the first two numbers first. Then to fetch the last part between (). Then seperate the two remaining values by ,. But this can also generates problems when multiple commas are used.

Example of this solution will be, something like:

preg_match("~^(\d+)~", $row->nodeValue, $number);
$number[1]; # will be 23

preg_match("#\((.*?)\)#", $row->nodeValue, $last);
$last[1]; # will be CA

$middleExp = explode("(", $row->nodeValue, 2);
$middle = substr((strlen($number[1])-1), strlen($row->nodeValue), $middleExp[0]);

$middleExp2 = explode(",", $middle);
$middleL = $middleExp2[0]; # will be Crimmons
$middleR = $middleExp2[1]; # will be Bob
Bob Ortiz
  • 452
  • 1
  • 3
  • 20
  • Thanks for the consolidation of 'explode' into one command. The only reason your other break up option won't work - the score could be 1, 2 or 3 digits. Is there anything to change the appearance of a space in the output? On the screen it sure looks like a space between the values - but 'explode' isn't working and everything is still piling into [0]. – user3741598 May 13 '15 at 14:19
  • Right now I can't tell what is between the values - looks like a single space but since explode isn't working with a single space, I'm trying other non-printable characters - white spaces, tabs, etc. The preg_match to get the $number did not work... – user3741598 May 13 '15 at 14:48
  • Try to return the ASCII value of that "space"-looking character. You can use `substr()` to get to the right position of the string and `ord()` to check the ASCII value. Then lookup the corresponding ASCII code, for example here http://ascii.cl/. And try to explode using that character or replace it first. – Bob Ortiz May 13 '15 at 14:52
  • Took another look at the HTML - &nbsp - explode doesn't appear to work with it so I'll try to work the Splitting option. Unless someone has a way to explode with &nbsp... – user3741598 May 13 '15 at 15:59
  • 1
    Try to do something like str_replace(chr(160), " ", $row->nodeValue); or str_replace("\xA0", " ", $row->nodeValue)(i didn't test it). But chr(160) will be a non-braking space. Or instead of replacing, exploding using \xA0 or chr(160). – Bob Ortiz May 13 '15 at 16:31
  • Thanks again - will try after lunch. PS - looked at your $middle = ... statement - I think $middleExp[0] should be the 1st var for substr. I also re-did the number preg_match (and will update). So I'm getting the score and state. Just need the last and first names. – user3741598 May 13 '15 at 16:40
  • str_replace("\xA0","/", $row->nodeValue); did allow me to get slashes in that I could then use to extract the first name. But there's still something invisible between the score and the last name. And get this: I found the strcspn function and ran it to find the first Capital letter - for a line '62 Smith...' (something between the 62 and S) - it returned 33! - something very strange. Thanks to everyone - I think I'm close enough to call this question resolved. – user3741598 May 13 '15 at 18:48
  • @user3741598 can you mark this question as answered if it helped you out? – Bob Ortiz Aug 03 '16 at 13:17