0

I'm trying to parse a data from a request to add the links result into a listbox. Here is the html code which I'm trying split.

<div class="rc" data-hveid="411"><h3 class="r"><a href="http://google.com/" onmousedown="return rwt
<div class="rc" data-hveid="48"><h3 class="r"><a href="http://google2.com/" onmousedown="return rwt

Is just an example. They are a lot...

Here is my code. It works, but not correct.

Dim request As System.Net.HttpWebRequest = System.Net.HttpWebRequest.Create("https://www.google.ro/search?q=Google")
    Dim response As System.Net.HttpWebResponse = request.GetResponse
    Dim sr As System.IO.StreamReader = New System.IO.StreamReader(response.GetResponseStream())
    Dim rssourcecode As String = sr.ReadToEnd
    Dim pp As String = rssourcecode
    Dim strRegex As String = "><a href="".*"""
    Dim myRegex As New Regex(strRegex, RegexOptions.None)
    For Each myMatch As Match In myRegex.Matches(pp)
        If myMatch.Success Then
            ListBox1.Items.Add(myMatch.Value.Split("""").GetValue(1))
        End If
    Next

This is the output: http://prntscr.com/9u000g/direct

Help me, please! I just want to get the first 5-6 websites links which Google show at first page.

Example: https://www.google.com/search?q=Google

Output: 1. https://www.google.com/

  1. https://www.facebook.com/Google/

  2. https://www.youtube.com/user/Google

  3. https://twitter.com/google

  4. https://google.com/about/careers/

qckmini6
  • 124
  • 2
  • 14
  • It would help if you add in your question what exactly are you trying to parse! You show your code, the data that you want to parse BUT not what you want from it. – Jorge Campos Jan 24 '16 at 00:19
  • The links from Google Search Results. - https://www.google.com/search?q=Google – qckmini6 Jan 24 '16 at 00:26
  • The href value in the search results is a tracking url within google.com which later redirects you to the actual page. You want to look for the green text in the results which actually contains the link. – svart Jan 24 '16 at 00:41
  • Why not use an HTML parser (e.g. [HTML Agility](https://www.nuget.org/packages/HtmlAgilityPack))? Also, have a look [at this](http://stackoverflow.com/a/1732454/4302070) – trashr0x Jan 25 '16 at 17:24
  • I hate HTML Agility Pack. I prefer Regex. I don't know RegEx at all, but I prefer it over almost anything. Anyway, Youssef Victor solved my problem with a short regex code. Thank you guys for your help! – qckmini6 Jan 25 '16 at 17:44

1 Answers1

1

As I understood you want to get any link exists in the variable rssourcecode which means any thing between (href=") and (")

Try using the following code:

Dim request As System.Net.HttpWebRequest = System.Net.HttpWebRequest.Create("https://www.google.ro/search?q=Google")
Dim response As System.Net.HttpWebResponse = request.GetResponse
Dim sr As System.IO.StreamReader = New System.IO.StreamReader(response.GetResponseStream())
Dim rssourcecode As String = sr.ReadToEnd

Dim MC As MatchCollection = Regex.Matches(rssourcecode, "href=""(.*?)""")
For i = 0 To MC.Count - 1
    MsgBox(MC(i).Groups(1).Value)
Next

Edit: You can use this pattern to get anything between (/url?q=) and (&amp)

/url\?q=(.*?)&amp

There is a \ mark between the ? because "?" is a special regex symbol you can escape special symbols by putting \ before it

Youssef13
  • 3,836
  • 3
  • 24
  • 41