0

Trying to build a web scraping script like feed43.com. Details: I have an html code as follows.

<div id="latest_header" onclick="getNews('79');">
                <img src="home_images/arrow.gif">&nbsp;2 DAY SEMINAR <br> <label id="news_pagedesp"><img src="home_images/li_desp.gif">NATIONAL SEMINAR..</label><label id="date_label">13th August 2014</label></div>
<div id="latest_header" onclick="getNews('78');">
                <img src="home_images/arrow.gif">&nbsp;2 DAYS WORKSHOP <br> <label id="news_pagedesp"><img src="home_images/li_desp.gif">INTERNATIONAL WOR..</label><label id="date_label">8th August 2014</label></div>

I write an expression like the following..

<div id="latest_header"{*}getNews('{%}'){*}&nbsp;{%}<br>{*}.gif">{%}..</label>

The result should be as per the following rules:

{*} - ignore everything {%} - use this as a value for a variable

that is the result should be all the occurrences of the given pattern. In above case:

{%1} - 79 {%2} - 2 DAY SEMINAR {%3} - NATIONAL SEMINAR

{%1} - 78 {%2} - 2 DAYS WORKSHOP {%3} - INTERNATIONAL WOR

I wasn't able to implement regular expressions and read at many places that it is not feasible to traverse html pages. I moved to simple_html_dom , but had no luck to get the above thing done in such an easy way. At-least, it wasn't possible for me to simulate the above thing.

The variables {*} & {%} are used to create a pattern when one uses feed43.com to create a feed of some website.

Huzaib Shafi
  • 1,153
  • 9
  • 24
  • 2
    I think the real issue here is you shouldn't be using regular expressions for this. Look into [DOMDocument](http://php.net/manual/en/class.domdocument.php) and [don't](http://stackoverflow.com/a/1732454/697370) [parse](http://blog.codinghorror.com/parsing-html-the-cthulhu-way/) [HTML](http://stackoverflow.com/a/590789/697370) [yourself](http://blogs.perl.org/users/kirk_kimmel/2012/08/q-when-not-to-use-regexp-a-html-parsing.html). – Jeff Lambert Sep 22 '14 at 13:03

2 Answers2

0

Your regex is incorrect. Use proper quantifiers to ignore items, and use for capturing the match subsections:

/<div id="latest_header"(?>.*?getNews\(')(?>(.*?)'\))(?>.*?&nbsp;)(?>(.*?)<br>)(?>.*?\.gif">)(.*?)<\/label>/s

* Atomic groups are used to eliminate . This regex without them would incur a lot of time backtracking, which is one of the major caveats with parsing HTML with regex.

This will be your match:

MATCH 1: [Group 1: 79] [Group 2: 2 DAY SEMINAR ] [Group 3: NATIONAL SEMINAR..]
MATCH 2: [Group 1: 78] [Group 2: 2 DAYS WORKSHOP ] [Group 3: INTERNATIONAL WOR..]

Here is a regex demo.

Community
  • 1
  • 1
Unihedron
  • 10,902
  • 13
  • 62
  • 72
  • Well in this particular case, this would be fine. But i want a general purpose solution, that is, i want to replace {*} & {%} with appropriate rules for scraping a page. A user may submit a page url and this rule
    {*}.gif">{%}.. . I should be able to translate it into some extraction rule using php dynamically. I hope my words make some sense
    – Huzaib Shafi Sep 22 '14 at 13:35
  • @user2047394 I'm not sure how you expect a practical implementation for that to be possible - What is it you actually need? This may easily be an [XY Problem](http://xyproblem.info), or you may end up blowing up your web server. Perhaps you may need to build your own parser for this over a regular expression. – Unihedron Sep 22 '14 at 13:37
  • i want to build a script same as used by feed43.com. I probably may need to develop a parser of my own, but everything available on the net is very close to what i want. For example web scraping scripts, DOM, regex etc. But i am confused as how to use them to get my job done. All i want is a general purpose replacement for the above titled variables, covered in {}. – Huzaib Shafi Sep 22 '14 at 13:53
  • @user2047394 Then ask for what you wanted, not a regular expression... You're in PHP, so there's not much I can do. Look into how other scripts works and fork them. – Unihedron Sep 22 '14 at 13:55
  • I did try using simple html dom in php.. but it is not easy to simulate those general purpose variables there. So i thought regex may be the solution and i believe so. Please consider the above examplex. Can you create a general purpose regex for the two variables... – Huzaib Shafi Sep 22 '14 at 14:11
0

This probably might be irrelevant but the following open source project achieves what i wanted to..

hFeeds

And all i actually wanted to was to be able to create RSS feeds for any webpage like Feed43.com And hFeeds works exactly like Feed43 .com and is as easy to use. The only difference being it use {h} in place of {%} and {i} in place of {*}. It generates the regular expression as i see it.

But thanks all for ur answers

Huzaib Shafi
  • 1,153
  • 9
  • 24