1

Below is the html code from which I want to fetch some data.

<div class="NS_projects__stats">
    <div class="digits_4" id="stats">
        <div class="row">
            <div class="col col-12 mb1 stat-item">
                <div class="num h1 bold" data-backers-count="107" id="backers_count">
                    <data class="Project1135352094" data-format="number" data-value="107" itemprop="Project[backers_count]">107</data>
                </div>
                <span class="bold h5">backers</span>
            </div>
            <div class="col col-12 mb1 stat-item">
                <div class="num h1 bold nowrap" data-goal="8000.0" data-percent-raised="0.909875" data-pledged="7279.0" id="pledged">
                    <data class="Project1135352094" data-currency="EUR" data-format="shorter_money" data-precision="0" data-value="7279.0" data-without_code="true" itemprop="Project[pledged]">€7,279</data>
                    <span class="money eur project_currency_code"></span>
                </div>
                <span class="bold h5">
                    pledged of <span class="money eur no-code">€8,000</span>
                    <span class="mobile-hide">goal</span>
                </span>
            </div>
            <span data-duration="30.041666666666668" data-end_time="2015-11-27T14:32:42-05:00" data-hours-remaining="566.7967307435142" id="project_duration_data"></span>
            <div class="col col-12 stat-item">
                <div class="num h1 bold">23</div>
                <span class="text bold h5">days to go</span>
            </div>
        </div>
    </div>
</div>

From above html code I have to fetch following data:

  • 107 backers
  • €7,279 pledged of €8,000 goal
  • 23 days to go

I successfully scraped the first one but not able to fetch 2nd and 3rd one. Below is my PHP code (using CURL) to fetch the first one.

$html = get($url); //get function uses CURL and gets html data
$pattern = "/<div class=\"num h1 bold\"(.*?)<\/div>/s";
preg_match($pattern,$htm,$match);
$match[1] = "<div".$match[1]."</div>";
return strip_tags($match[1]); 
NickNo
  • 872
  • 15
  • 32
Brainy Prb
  • 433
  • 1
  • 9
  • 22

3 Answers3

0
$pattern = "/<div class=\"num h1 bold\"(.*?)<\/div>/s";
$pattern2 = "/<div class=\"col col-12 mb1 stat-item\"(.*?)<\/div>/s";
$pattern3 = "/<div class=\"col col-12 stat-item\"(.*?)<\/div>/s";
Krishna Gupta
  • 695
  • 4
  • 15
0

Try this,

function rip_tags($string) { 

    // ----- remove HTML TAGs ----- 
    $string = preg_replace ('/<[^>]*>/', ' ', $string); 

    // ----- remove control characters ----- 
    $string = str_replace("\r", '', $string);    // --- replace with empty space
    $string = str_replace("\n", ' ', $string);   // --- replace with space
    $string = str_replace("\t", ' ', $string);   // --- replace with space

    // ----- remove multiple spaces ----- 
    $string = trim(preg_replace('/ {2,}/', ' ', $string));

    return $string; 

}

$html = get($url); //get function uses CURL and gets html data
echo rip_tags($html);

Result: 107 backers €7,279 pledged of €8,000 goal 23 days to go
It can be further modified as per requirement. For reference, please check here

Ravneet
  • 300
  • 1
  • 5
  • Thanks a lot , I never thought it this way, but still there is a problem it is automatically converting € to $. I am not getting €7,279 pledged of €8,000 instead I am getting $8,051 pledged of $8,850 goal . Should I declare some character encoding or something like that ? – Brainy Prb Nov 04 '15 at 05:45
  • Is there any other code also written on the page. If so, can you please share the same. – Ravneet Nov 04 '15 at 06:16
  • I am fetching content from below link https://www.kickstarter.com/projects/35540661/new-colors-59-stainless-milanaise-loop-for-apple-w – Brainy Prb Nov 04 '15 at 06:24
  • Please check, html data returned from curl has euro symbol or $ symbol. – Ravneet Nov 04 '15 at 06:43
0

I'd suggest parsing the HTML-String as HTML...

you can use http://php.net/manual/en/domdocument.loadhtml.php.

or some other 3rd party parser. (I used http://simplehtmldom.sourceforge.net before, and it was good).

Tomer W
  • 3,395
  • 2
  • 29
  • 44