1

Google has added a nice feature which makes you get instant info about any of famous people, for example when you search for "Barack Obama" you get a bio and a photo on the results page so you may not have to visit any of the results to get that info.

Live sample : http://goo.gl/vf1ti3

What I'm trying to do is to get the URL of the image at the left-side of instant info box. I want to accomplish that using System.Text.RegularExpressions.Regex from the HTML code.

I can get the source of the result page with this code :

private void getInfoAboutCelebrities()
{
    try
    {
        string celebrityName = null;

        Dispatcher.Invoke((Action)delegate()
        {
            DisableUI();
            celebrityName = celebrityName_textBox.Text;
        });

        celebrityName = HttpUtility.UrlEncode(celebrityName);
        string queryURL = "http://www.google.com/search?q=" + celebrityName + "+Height&safe=active&oq=" + celebrityName + "+Height&gs_l=heirloom-serp.12...0.0.0.3140.0.0.0.0.0.0.0.0..0.0....0...1ac..24.heirloom-serp..0.0.0.hXJwfydNFhk";

        HttpWebRequest request = (HttpWebRequest)HttpWebRequest.Create(queryURL);
        request.ContentType = "application/x-www-form-urlencoded";
        request.UserAgent = "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:19.0) Gecko/20100101 Firefox/19.0";
        request.Method = "GET";
        // make request for web page
        HttpWebResponse response = (HttpWebResponse)request.GetResponse();
        StreamReader htmlSource = new StreamReader(response.GetResponseStream());

        string htmlStringSource = string.Empty;
        htmlStringSource = htmlSource.ReadToEnd();
        response.Close();

        // Extracting height
        var regex = new Regex(@"<span class=""kno-a-v"">(.*?)</span>");
        var match = regex.Match(htmlStringSource);
        var result = match.Groups[1].Value;

        ///////////////////////////////////////////////////////////
        // Extracting photo ( which I couldn't do it
        regex = new Regex(@"data:image/jpeg;base64(.*?)\x3d\x3d");
        match = regex.Match(htmlStringSource);
        ///////////////////////////////////////////////////////////

        result = HttpUtility.HtmlDecode(result);

        if (String.IsNullOrWhiteSpace(result))
            MessageBox.Show("Sorry, no such entry.", "Error", MessageBoxButton.OK, MessageBoxImage.Error);
        else
        {
            Dispatcher.Invoke((Action)delegate()
            {
                preloader_Image.Visibility = Visibility.Hidden;
                MessageBox.Show(result);
            });
        }
        Dispatcher.Invoke((Action)EnableUI);
    }
    catch (Exception ex)
    {
        MessageBox.Show(ex.Message, "Error");
    }
}

Can anyone tell me what Regular Expression I should use? ( Because actually I can't even get the URL myself with viewing the source code! )

John Saunders
  • 160,644
  • 26
  • 247
  • 397
Alaa Salah
  • 885
  • 1
  • 13
  • 23
  • 5
    You'll get the classical answer Don't use Regex to parse html (http://stackoverflow.com/a/1732454/932418) and use [HtmlAgilityPack](http://htmlagilitypack.codeplex.com/) – I4V Aug 09 '13 at 01:46
  • They always say the same, but I don't need to do a lot of stuff with the HTML code, so Regex would be the best choice for me. – Alaa Salah Aug 09 '13 at 01:51
  • 2
    But You can't get your best choice work. So why don't you give it a try. Learning curve is not much steep. – I4V Aug 09 '13 at 02:01
  • If I could extract both ( height I extract using Regex and the URL of the image ) using **HtmlAgilityPack**, then it would be great! But I couldn't figure out how to work with it. And that image seems to be impossible to get, when I view the source I can't really find it's URL there. – Alaa Salah Aug 09 '13 at 02:56

1 Answers1

3

It's quite likely that the image URL isn't even in the HTML that you get back. There's a whole lot of Javascript on that page. The page is intended to be viewed in a browser, which can run the Javascript and download images, format the page, etc. There's no guarantee that the information displayed is available in the HTML.

I suspect, however, that the image you're looking for is the embedded image that's base64 encoded near the end of the file. Search for imgthumb13, and you'll find it. Probably you can convert that to binary and then decode the image. If you know the image format. (No, I don't.)

Google's results pages are not at all designed to be read by bots or scrapers. And in fact Google frowns on you using a scraper to read their results pages. If they determine that you're using a scraper on their pages, they'll block you. If you want to process Google search results, then you should be using the Google Search API.

Also see Any form of Google Search API available for C#?.

One other thing. Google is continually changing the format of their search results pages. Even when the pages look the same, the internal structure can be much different. You'll find that the code you write to scrape today's search results pages is likely to break next month. I learned that one the hard way.

Community
  • 1
  • 1
Jim Mischel
  • 131,090
  • 20
  • 188
  • 351
  • +1 for "You'll find that the code you write to scrape today's search results pages is likely to break next month. I learned that one the hard way." – NoName May 04 '15 at 05:02