0

I'm using Delphi with the JCLRegEx and want to capture all the result URL's from a google search. I looked at HackingSearch.com and they have an example RegEx that looks right, but I cannot get any results when I try it.

I'm using it similar to:

Var re:JVCLRegEx;
    I:Integer; 
Begin
  re := TJclRegEx.Create;

  With re do try
    Compile('class="?r"?>.+?href="(.+?)".*?>(.+?)<\/a>.+?class="?s"?>(.+?)<cite>.+?class="?gl"?><a href="(.+?)"><\/div><[li|\/ol]',false,false);  
    If match(memo1.lines.text) then begin
      For I := 0 to captureCount -1 do
        memo2.lines.add(captures[1]);
    end;
  finally free;
  end;
  freeandnil(re);
end;

Regex is available at hackingsearch.com

I'm using the Delphi Jedi version, since everytime I install TPerlRegEx I get a conflict with the two...

Rob Kennedy
  • 161,384
  • 21
  • 275
  • 467
Scott Tyler
  • 67
  • 2
  • 6
  • 1
    Instead of enclosing it in `code` and `pre` tags, select the code and click ctrl-k to format it (or manually indent each line with 4 spaces). And btw, don't parse html with regex, use an html parser instead. Have you seen this http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – Amarghosh Jan 23 '10 at 10:13
  • That's php code, need delphi code. And all the delphi parsers that I've come across are not so good. I can get one to do some things, and other to do others.. RegEx seemed like a better solution, except that I'm not good with it. – Scott Tyler Jan 23 '10 at 10:45
  • 2
    Nobody is good enough to parse html with regex: not even Chuck Norris. – Amarghosh Jan 23 '10 at 11:55
  • I know there are plenty of people out there that can regex html... I know some of the better parsers use RegEx to do the work. – Scott Tyler Jan 23 '10 at 23:25
  • @Scott: At the time of writing at least 2466 people on SO disagree with your knowledge. It may be worth it to take the accepted answer to question 1732348 (see link in first comment) into consideration. – mghie Jan 24 '10 at 12:29
  • @mghie No offense but there are a lot of people who just don't know what they are talking about... just because they don't understand regex they say it cannot be done. It's like if someone says it's fun to jump of a bridge, then they are right? RegEx it far superior than most parser, for the shear power of it. And I just got this to work with regex of: class="?r|j"?>.+?href="(.+?)".*?>(.+?)<\/a>.+?class="?s"?>(.+?).+?class="?gl"?><\/div> – Scott Tyler Jan 24 '10 at 13:02
  • @Scott: How would you counter http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1758162#1758162 then? – mghie Jan 24 '10 at 13:13
  • @mghie, I pasted the wrong regex code above, but if you look at the code in my answer, and try it, it works with flying colors http://regexhero.net/tester/ and to counter that other argument, lack of RegEx knowedgle and experience. And you can make RegEx work like a parser and dig though the code several times if you wanted to. – Scott Tyler Jan 24 '10 at 15:48
  • Voted to close as "not a real question" because the question **has no answer**. Scott, HTML is not a regular language. It cannot be parsed with regular expressions. Given **any** regex, I can provide you with valid HTML code that your regex cannot recognize. This isn't something that can be overcome by having more regex knowledge or experience. – Rob Kennedy Jan 25 '10 at 18:38

4 Answers4

1

Offtopic: You can try Google AJAX Search API: http://code.google.com/apis/ajaxsearch/documentation/

DiGi
  • 2,528
  • 18
  • 26
  • the API only gives 10 results, not enough result data. I use the API for everything but this part of the project. – Scott Tyler Jan 23 '10 at 23:05
1

Below is a relevant section from Google search results for the term python tuple. (I modified it to fit the screen here by adding new lines here and there, but I tested your regex on the raw string obtained from Google's source as revealed by Firebug). Your regex gave no matches for this string.

<li class="g w0">
  <h3 class="r">
    <a onmousedown="return rwt(this,'','','res','2','AFQjCNG5WXSP8xy6BkJFyA2Emg8JrFW2_g','&amp;sig2=4MpG_Ib3MrwYmIG6DbZjSg','0CBUQFjAB')" 
      class="l" href="http://www.korokithakis.net/tutorials/python">Learn <em>Python</em> in 10 minutes | Stavros's Stuff</a>
  </h3>
  <span style="display: inline-block;">
    <button class="w10">
    </button>
    <button class="w20">
    </button>
  </span>
  <span class="m">&nbsp;<span dir="ltr">- 2 visits</span>&nbsp;<span dir="ltr">- Jan 21</span></span>
  <div class="s">
  The data structures available in <em>python</em> are lists, <em>tuples</em>
   and dictionaries. Sets are available in the sets library (but are built-in in <em>
  Python</em> 2.5 and <b>...</b><br>
  <cite>
    www.korokithakis.net/tutorials/<b>
    python</b>
     - 
  </cite>
  <span class="gl">
    <a onmousedown="return rwt(this,'','','clnk','2','AFQjCNFVaSJCprC5enuMZ9Nt7OZ8VzDkMg','&amp;sig2=4qxw5AldSTW70S01iulYeA')" 
      href="http://74.125.153.132/search?q=cache:oeYpHokMeBAJ:www.korokithakis.net/tutorials/python+python+tuple&amp;cd=2&amp;hl=en&amp;ct=clnk&amp;client=firefox-a">
      Cached
    </a>
     - <button title="Comment" class="wci">
    </button>
    <button class="w4" title="Promote">
    </button>
    <button class="w5" title="Remove">
    </button>
  </span>
  </div>
  <div class="wce">
  </div>
  <!--n-->
  <!--m-->
</li>

FWIW, I guess one of the many reasons is that there is no <Va> in this result at all. I copied the full html source from Firebug and tried to match it with your regex - didn't get any match at all.

Google might change the way they display the results from time to time - at a given time, it can vary depending on factors like your logged in status, web history etc. The particular regex you came up with might be working for you for now, but in the long run it will become difficult to maintain. People suggest using html parser instead of giving a regex because they know that the solution won't be stable.

Amarghosh
  • 58,710
  • 11
  • 92
  • 121
  • @Amarghosh: I'm fully with you regarding the topic of html parsing with regular expressions, however this is a rant and no answer, and it achieves absolutely nothing. Consider removing this, and adding a comment to the answer claiming to be the solution instead. – mghie Jan 25 '10 at 14:35
  • @mghie The post was inspired largely by OP's tone in his comments. Reworded to remove the ranting. – Amarghosh Jan 25 '10 at 14:54
  • Thanks, much more constructive, +1. – mghie Jan 25 '10 at 15:11
0

If you need to debug regular expressions in any language you need to look at RegExBuddy, its not free but it will pay for itself in a day.

Toby Allen
  • 10,997
  • 11
  • 73
  • 124
0
class=r?>.+?href="(.+?)".*?>(.+?)<\/a>.+?class="?s"?>(.+?)<cite>.+?class="?gl"?>

works for now.

Rob Kennedy
  • 161,384
  • 21
  • 275
  • 467
Scott Tyler
  • 67
  • 2
  • 6