Parsing website with regular expression

Question

I have an assignment to parse http://www.olx.in/cars-cat-378 to get the cars , location and price using regular expression . I have seen a lot of posts suggesting that regular expression is not proper for parsing web, but I still have to use it at least for this time . I have tried the way shown below . But this is not working .

<?php

 /**
 * Initialize the cURL session
 */
 $ch = curl_init();


 /**
 * Set the URL of the page or file to download.
 */
 curl_setopt($ch, CURLOPT_URL, 'http://www.olx.in/cars-cat-378');

 /**
 * Ask cURL to return the contents in a variable instead of simply echoing them to  the browser.
 */
 curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);

 /**
 * Execute the cURL session
 */
 $contents = curl_exec($ch);
 /*
  print the $contents variable
 */
 $reg='/<div class="li .*?"><div class="row clearfix"><div class="c-1 table-cell"><div class="cropit">.*?<\/div><\/div><div class="second-column-container  table-cell"><h3><a .*?>(.*?)<\/a><\/h3><div class="c-4"><span>(.*?)<\/span> - <span>(.*?)<\/span> - <span>(.*?)<\/span> - <span>(.*?)<\/span><\/div><span class="itemlistinginfo clearfix"><a .*?>(.*?)<\/a><\/span><div .*?><\/div><\/div><div class="third-column-container table-cell">(.*?)<\/div><div class="fourth-column-container table-cell">(.*?)<\/div><\/div><\/div>/';

 preg_match($reg,$contents,$result);

 var_dump($result);

 /**
 * Close cURL session
 */
 curl_close ($ch);





?>

The html for each list item of the page is as below----

<div class="li even">
       <div class="row clearfix">
           <div class="c-1 table-cell">
                <div class="cropit">
                    <a class="pics-lnk" href="http://newdelhi.olx.in/honda-prelude-2-door-sports-car-for-sale-iid-437128570">
                        <img src="http://images04.olx-st.com/ui/14/85/70/t_1347220402_437128570_4.jpg" width="111"
                            alt="HONDA PRELUDE,,2 DOOR ,,SPORTS CAR FOR SALE." title="HONDA PRELUDE,,2 DOOR ,,SPORTS CAR FOR SALE. - India"
                            height="83" style="margin-top:0px;" />
                    </a>
                </div>
            </div>
            <div class="second-column-container  table-cell">
        <h3>
        <a href="http://newdelhi.olx.in/honda-prelude-2-door-sports-car-for-sale-iid-437128570"  title="HONDA PRELUDE,,2 DOOR ,,SPORTS CAR FOR SALE. - India">
        HONDA PRELUDE,,2 DOOR ,,SPORTS CAR FOR SALE.</a>
        </h3>


        <div class="c-4">
        <span>Year: 1996</span> - <span>Make: Honda</span> - <span>Model: Prelude</span> - <span>66,400.00 km</span>    </div>
        <span class="itemlistinginfo clearfix">
        <a href="http://newdelhi.olx.in/cars-cat-378">Cars - Delhi</a>    </span>

        <div style="display:none;" class="fbfriends_loadme" id="fbfriends_loadme_437128570" rel="5656149"></div>

            </div>            
            <div class="third-column-container table-cell">
                                    à¤° 2,65,000.00                              </div>
            <div class="fourth-column-container table-cell">
                                    Yesterday, 15:53                            </div>            
        </div>
    </div>

The regular expression I have used is -----

/<div class="li .*?"><div class="row clearfix"><div class="c-1 table-cell"><div class="cropit">.*?<\/div><\/div><div class="second-column-container  table-cell"><h3><a .*?>(.*?)<\/a><\/h3><div class="c-4"><span>(.*?)<\/span> - <span>(.*?)<\/span> - <span>(.*?)<\/span> - <span>(.*?)<\/span><\/div><span class="itemlistinginfo clearfix"><a .*?>(.*?)<\/a><\/span><div .*?><\/div><\/div><div class="third-column-container table-cell">(.*?)<\/div><div class="fourth-column-container table-cell">(.*?)<\/div><\/div><\/div>/'

Wouldn't be easier for your time and our time if you use [DOMDocument](http://uk.php.net/DOMDocument) ? — Mihai Iorga, Sep 10 '12 at 09:21
Did your assignment include using regex? If so, tell whoever gave it to you to go back to school. If not, then don't use regex. — Aleks G, Sep 10 '12 at 09:22
[You can't parse (X)HTML with regex](http://stackoverflow.com/a/1732454/569101) — j0k, Sep 10 '12 at 09:22
Telling the OP not to use Regex when the assignment is to use regex is not very constructive. @AleksG - This is of no help to the OP - would you tell your professor to go back to school (assuming you don't want to fail the course)? — Oded, Sep 10 '12 at 09:24
@j0k - Of course you can. For a _known_ structure it is _fine_. The problem is with _unknown_ or HTML from many different sources. — Oded, Sep 10 '12 at 09:25
@Oded Something tells me this is not a school assignment. Any decent professor in school wouldn't give students assignments that are against all reasonable real-life practices. — Aleks G, Sep 10 '12 at 09:25
@AleksG - Possibly. But if your lead dev demands you use regex for this and you are new? What if your job is on the line? — Oded, Sep 10 '12 at 09:27
@AleksG citing: "I have an assignment to parse ... using regular expression." There's no way around it. — John Dvorak, Sep 10 '12 at 09:28
@Oded , the page contains a list of similar structure , so I think this can be matched with a regular expression... would you please check the regular expression I tried to use ?? — , Sep 10 '12 at 09:30
I don't know PHP and the specific regex dialect it uses. Sorry. I was just trying to explain to the dogmatic commenters that the question is legitimate. — Oded, Sep 10 '12 at 09:33
@Oded The regex language is largely universal. The differences are in what is / is not supported. The regex constructs mean always the same in every regex flavor that supports them. This particular regex contains none of the flavor-specific syntax. — John Dvorak, Sep 10 '12 at 09:45
go to regexpal.com or some site similar, paste your html in one box, paste your regular expression in another, modify regexp, get instant feedback. repeat till you find the right regexp — Prasanth, Sep 10 '12 at 09:48
@JanDvorak - Most of the basic constructs are the same, yes. But some dialects don't conform to things like whitespace escapes and capturing parens. I don't know the flavor so can't comment. — Oded, Sep 10 '12 at 09:50
I am sorry if my question was not proper, for which it might have been closed... But I have solved my problem. Jan Dvorak's answer helped me a lot . I would like to post my solution to this question if it is reopened . preg_match( '#
]*>[\s\S]*?(.*?)[\s\S]*?
#i',$d, $match ); echo $match[0]; preg_match( '#\s*(.*?)\s*#i', $d, $match2 ); echo $match2[1]; — , Sep 12 '12 at 05:33

score 1 · Accepted Answer · answered Sep 10 '12 at 09:44

The problem is that if the source code you are parsing contains whitespace, you won't match it. You should sprinkle \s*? every now and then.

The same applies in your <a .*?>(.*?)<\/a> block. . matches the space character, but not a newline. Use <a .*?>\s*?(.*?)\s*?<\/a>. Whenever you skip a large block, .*? won't do. Use [\s\S]*? (whitespace or non-whitespace) instead.

Third, you are using preg_match, which only gives you one element. You should use preg_match_all

Parsing website with regular expression

]>[\s\S]?(.?)[\s\S]?

1 Answers1

Parsing website with regular expression

]*>[\s\S]*?(.*?)[\s\S]*?

1 Answers1

]>[\s\S]?(.?)[\s\S]?