0

I am trying to scape data (Name, varietal, format and price) from this site https://aabalat.com/wine/country/france. I have made an array by name $urls and I push every link in the array. For each new curl session, I will get 20 new data about wine. I need to capture format at first and push to the array as you can see on my code below. When I print $french_wines_formats_matches it work successfully. But when I want to print $french_wines_format_array it is not working very well.

I am new in scraping and I am not much experience with that.

    // Array contains 197 links
$urls = array();
array_push($urls, "https://aabalat.com/wine/country/france");


// This for loop makes others links
for($i = 1; $i < 5; $i++)
{
  $urls[] = "https://aabalat.com/wine/country/france?page=".$i;
}

// echo "<pre>";
// print_r($urls);
// echo "</pre>";

$french_wines_array = array();
$french_wines_title_array = array();
$french_wines_varietal_array = array();
$french_wines_format_array = array();
$french_wines_price_array = array();

// Repeat curl session until url exists.
foreach($urls as $url)
{
  $curl = curl_init();
  curl_setopt($curl, CURLOPT_URL, $url);

  curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
  curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
  curl_setopt($curl, CURLOPT_VERBOSE, true);

  $output = curl_exec($curl);
  $info = curl_getinfo($curl);
  $err = curl_error($curl);
  $ern = curl_errno($curl);

  $french_wine_formats_pattern = '!<span class="wine-list-item-format">(.*)</span>!mi';
  preg_match_all($french_wine_formats_pattern, $output, $french_wines_formats_matches);

  foreach($french_wines_formats_matches[0] as $french_wines_formats_match)
  {
    $french_wines_format_array[] = $french_wines_formats_match;
  }

  echo "<pre>";
  print_r($french_wines_format_array);
  echo "</pre>";

curl_close($curl);
sleep(rand(2, 5));

}
  • 1
    [Don't use regular expressions for parsing HTML](https://stackoverflow.com/a/1732454/5407848), instead try [this library "simple_html_dom"](https://stackoverflow.com/a/9813422/5407848), I used and liked it – Accountant م Mar 04 '19 at 20:03
  • you should be more specific than `not working very well.` - what did you expect, and what did you get instead? – hanshenrik Mar 04 '19 at 23:50

1 Answers1

0

Your code and regex seem to work (I tried them). I was unable to replicate your cURL call. Try the following instead of just $output = curl_exec($curl), see if you catch any cURL errors:

    if(!$output = curl_exec($curl)){
        if (curl_error($ch)) {
            die(curl_error($ch));
        }
    }

Finally, I tried a simple file_get_contents() and that seemed to work:

    $url = "https://aabalat.com/wine/country/france";
    $output= file_get_contents($url);
dearsina
  • 4,774
  • 2
  • 28
  • 34
  • The array which I need to fill is `$french_wines_format_array`. When I use `var_dump($french_wines_formats_matches);` from `preg_match_all` function it's working successfully. When I try to push data into the array `$french_wines_format_array` like this below: `foreach($french_wines_formats_matches[0] as $french_wines_formats_match) { $french_wines_format_array[] = $french_wines_formats_match; }` It isn't working very well. I don't know how to solve this problem. – Boban Mladenovic Mar 04 '19 at 22:49
  • When you say isn't working very well, what kind of result do you get? – dearsina Mar 05 '19 at 03:56