0

I know how to make the vb program go to Google. I even know how to navigate around, but I don't know how to manipulate the results.

Basically I want the program to grab search results from Google and output them to a listbox. So if the user searches for burgers, then the search results would be output to a listbox. Does anyone know how to do this?

here's my code so far:

Public Class Form1

Dim look, retrieve As String

Private Sub Search_Click(sender As Object, e As EventArgs) Handles Search.Click
look = InputBox("What are you looking for?")
look = look.Replace(" ", "+")
Dim G1 As String = "http://www.google.co.uk/#hl=en&tbo=d&output=search&sclient=psy-ab&q="
WebBrowser1.Navigate(G1 + look)

retrieve = InputBox("What links do you want to retrieve?")

End Sub

End Class 

I know it is easier to use the google api, but it is also a lot slower. I've used the API in the past and have seen performance issues. I've just seen in another thread how to download a website's source; pretty quickly. I just don't know how to grab the urls from the downloaded source. Is anyone here any good with string manipulation?

Code so far:

sourcecode = ((New Net.WebClient).DownloadString(G1 + look)) 
Gergo Erdosi
  • 40,904
  • 21
  • 118
  • 94
Santa
  • 103
  • 1
  • 3
  • 16

1 Answers1

0

If you look into XPATH and are not adverse to using open source third party tools, the HTML Agility Pack (Cose Examples) is supposed to be a great tool for parsing html.

Another option, that can be a pain, is to convert the source html string into a valid xml document, and then parse it using VB's xml name space. I have done this in an application I use to parse youtube play lists. The issue with this approach is it takes a bit of manual cleaning of the html string before you can turn it into an xml document.

Lastly you could try to digest the html string using string methods only, however this is going to be error prone and will again depend very largely on the structure of the document.

No matter what, once you have your method of parsing the html, currently in Google search results there is a div with the ID 'Search'. From a purely string stand point you could search for this in your source string as such:

dim searchTerm as string = "<div id=""search"""
dim searchLoc as integer = 0
searchLoc = sourceCode.indexOf(searchTerm)

once you know where the search results section starts you can then start searching first for "<li class=""g""" tokens and then "<h3 class=""r""" tokens inside those. Inside the h3 is where the result text is. You would want to consume to the first </h3> and </li> respectively to get the tokens.

once you had this text, you would need to sanitize it by searching through it and removing the html tags. You could easily write an algorithm to consume just the link text by looping through the indexes of key characters.

The whole point is to break it down into smaller pieces incrementally and then digest the smaller pieces. No matter how you approach it you are going to be doing this. However using a parser of some kind and utilizing the power of XPATH selector expressions would make it much easier than manually generating the tokens.

The pure string way is going to be the most difficult and also the slowest way to try and accomplish this. I would highly recommend trying to find a way to do it with some form of HTML parser otherwise you may go mad before you get a working solution.

As a final note, it looks like you are using a webbrowser control on your form. You can use this control and its related classes to parse the html of the pages it retrieves. I have done this before and it is not the most efficient way of scraping the web, but it can be very easy. Look into the HTMLDocument class for methods involving this controls return objects.

Pow-Ian
  • 3,607
  • 1
  • 22
  • 31