What regex can I use to extract URLs from a Google search?

Question

I'm using Delphi with the JCLRegEx and want to capture all the result URL's from a google search. I looked at HackingSearch.com and they have an example RegEx that looks right, but I cannot get any results when I try it.

I'm using it similar to:

Var re:JVCLRegEx;
    I:Integer; 
Begin
  re := TJclRegEx.Create;

  With re do try
    Compile('class="?r"?>.+?href="(.+?)".*?>(.+?)<\/a>.+?class="?s"?>(.+?)<cite>.+?class="?gl"?><a href="(.+?)"><\/div><[li|\/ol]',false,false);  
    If match(memo1.lines.text) then begin
      For I := 0 to captureCount -1 do
        memo2.lines.add(captures[1]);
    end;
  finally free;
  end;
  freeandnil(re);
end;

Regex is available at hackingsearch.com

I'm using the Delphi Jedi version, since everytime I install TPerlRegEx I get a conflict with the two...

Instead of enclosing it in `code` and `pre` tags, select the code and click ctrl-k to format it (or manually indent each line with 4 spaces). And btw, don't parse html with regex, use an html parser instead. Have you seen this http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 — Amarghosh, Jan 23 '10 at 10:13
That's php code, need delphi code. And all the delphi parsers that I've come across are not so good. I can get one to do some things, and other to do others.. RegEx seemed like a better solution, except that I'm not good with it. — Scott Tyler, Jan 23 '10 at 10:45
Nobody is good enough to parse html with regex: not even Chuck Norris. — Amarghosh, Jan 23 '10 at 11:55
I know there are plenty of people out there that can regex html... I know some of the better parsers use RegEx to do the work. — Scott Tyler, Jan 23 '10 at 23:25
@Scott: At the time of writing at least 2466 people on SO disagree with your knowledge. It may be worth it to take the accepted answer to question 1732348 (see link in first comment) into consideration. — mghie, Jan 24 '10 at 12:29
@mghie No offense but there are a lot of people who just don't know what they are talking about... just because they don't understand regex they say it cannot be done. It's like if someone says it's fun to jump of a bridge, then they are right? RegEx it far superior than most parser, for the shear power of it. And I just got this to work with regex of: class="?r|j"?>.+?href="(.+?)".*?>(.+?)<\/a>.+?class="?s"?>(.+?).+?class="?gl"?><\/div> — Scott Tyler, Jan 24 '10 at 13:02
@Scott: How would you counter http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1758162#1758162 then? — mghie, Jan 24 '10 at 13:13
@mghie, I pasted the wrong regex code above, but if you look at the code in my answer, and try it, it works with flying colors http://regexhero.net/tester/ and to counter that other argument, lack of RegEx knowedgle and experience. And you can make RegEx work like a parser and dig though the code several times if you wanted to. — Scott Tyler, Jan 24 '10 at 15:48
Voted to close as "not a real question" because the question **has no answer**. Scott, HTML is not a regular language. It cannot be parsed with regular expressions. Given **any** regex, I can provide you with valid HTML code that your regex cannot recognize. This isn't something that can be overcome by having more regex knowledge or experience. — Rob Kennedy, Jan 25 '10 at 18:38

score 1 · Answer 1 · answered Jan 23 '10 at 11:39

1

Offtopic: You can try Google AJAX Search API: http://code.google.com/apis/ajaxsearch/documentation/

answered Jan 23 '10 at 11:39

DiGi

2,528
18
26

the API only gives 10 results, not enough result data. I use the API for everything but this part of the project. – Scott Tyler Jan 23 '10 at 23:05

Amarghosh · Answer 2 · 2010-02-18T09:35:02.017

Below is a relevant section from Google search results for the term python tuple. (I modified it to fit the screen here by adding new lines here and there, but I tested your regex on the raw string obtained from Google's source as revealed by Firebug). Your regex gave no matches for this string.

<li class="g w0">
  <h3 class="r">
    <a onmousedown="return rwt(this,'','','res','2','AFQjCNG5WXSP8xy6BkJFyA2Emg8JrFW2_g','&amp;sig2=4MpG_Ib3MrwYmIG6DbZjSg','0CBUQFjAB')" 
      class="l" href="http://www.korokithakis.net/tutorials/python">Learn <em>Python</em> in 10 minutes | Stavros's Stuff</a>
  </h3>
  <span style="display: inline-block;">
    <button class="w10">
    </button>
    <button class="w20">
    </button>
  </span>
  <span class="m">&nbsp;<span dir="ltr">- 2 visits</span>&nbsp;<span dir="ltr">- Jan 21</span></span>
  <div class="s">
  The data structures available in <em>python</em> are lists, <em>tuples</em>
   and dictionaries. Sets are available in the sets library (but are built-in in <em>
  Python</em> 2.5 and <b>...</b><br>
  <cite>
    www.korokithakis.net/tutorials/<b>
    python</b>
     - 
  </cite>
  <span class="gl">
    <a onmousedown="return rwt(this,'','','clnk','2','AFQjCNFVaSJCprC5enuMZ9Nt7OZ8VzDkMg','&amp;sig2=4qxw5AldSTW70S01iulYeA')" 
      href="http://74.125.153.132/search?q=cache:oeYpHokMeBAJ:www.korokithakis.net/tutorials/python+python+tuple&amp;cd=2&amp;hl=en&amp;ct=clnk&amp;client=firefox-a">
      Cached
    </a>
     - <button title="Comment" class="wci">
    </button>
    <button class="w4" title="Promote">
    </button>
    <button class="w5" title="Remove">
    </button>
  </span>
  </div>
  <div class="wce">
  </div>
  <!--n-->
  <!--m-->
</li>

FWIW, I guess one of the many reasons is that there is no <Va> in this result at all. I copied the full html source from Firebug and tried to match it with your regex - didn't get any match at all.

Google might change the way they display the results from time to time - at a given time, it can vary depending on factors like your logged in status, web history etc. The particular regex you came up with might be working for you for now, but in the long run it will become difficult to maintain. People suggest using html parser instead of giving a regex because they know that the solution won't be stable.

@Amarghosh: I'm fully with you regarding the topic of html parsing with regular expressions, however this is a rant and no answer, and it achieves absolutely nothing. Consider removing this, and adding a comment to the answer claiming to be the solution instead. — mghie, Jan 25 '10 at 14:35
@mghie The post was inspired largely by OP's tone in his comments. Reworded to remove the ranting. — Amarghosh, Jan 25 '10 at 14:54

score 0 · Answer 3 · answered Jan 23 '10 at 10:47

0

If you need to debug regular expressions in any language you need to look at RegExBuddy, its not free but it will pay for itself in a day.

answered Jan 23 '10 at 10:47

Toby Allen

10,997
11
73
124

I will look into it again, I looked at it a while back.. Probably worth the $40. – Scott Tyler Jan 23 '10 at 23:06
I've created http://yoy.be/re to test regexes, and put them to work on large chunks of data in all kinds of shapes and forms – Stijn Sanders Jan 26 '10 at 15:59

score 0 · Accepted Answer · edited Jan 25 '10 at 18:25

0

class=r?>.+?href="(.+?)".*?>(.+?)<\/a>.+?class="?s"?>(.+?)<cite>.+?class="?gl"?>

works for now.

edited Jan 25 '10 at 18:25

Rob Kennedy

161,384
21
275
467

answered Jan 24 '10 at 13:02

Scott Tyler

67
2
6

Whoever gave me the negative Thanks, I appreciate it! – Scott Tyler Jan 24 '10 at 15:43
I didn't downvote you but I guess what "earned" you negative feedback is that you didn't explain what you have done. How does this differ from your previous solution? – johnny Jan 25 '10 at 07:30
I changed the regex code to work with the current google output. – Scott Tyler Jan 28 '10 at 21:53

What regex can I use to extract URLs from a Google search?

4 Answers4

Linked