0

I want to extract html table with the HTMLAgilityPack. Because the website I want extract data from has put a name, address, postalcode and city in the same string, I used the

string nawhtml = cols[0].InnerHtml;

to get the html code and now i want to use regex to separate the name, street, postalcode and placename and put it in separate strings in c#. The code I am getting from HTMLAgilibilitypack is this:

<b>Name</b><br>
Street<br>
Postalcode Placename<br>

This is the code written already:

Regex match1 = new Regex(@"<b>\s*(.+?)\s*</b><br>");
Match naamtankstation = match1.Match(nawhtml);
Console.WriteLine("Naam         : " + naamtankstation.Groups[1].Value);


Regex match2 = new Regex(@"</b><br>\s*(.+?)\s*<br>");
Match straattankstation = match2.Match(nawhtml);
Console.WriteLine("Straat       : " + straattankstation.Groups[1].Value);

Regex match3 = new Reg**strong text**ex(@"<br>{2,}\s*(.+?)\s*<br>");
Match postcodetankstation = match3.Match(nawhtml);
Console.WriteLine(postcodetankstation.Groups[1].Value);

But the last regex doesn't work. This is not the only thing I tried.

How can I make a regex match witch understands that i want the postalcode and placename in separate strings?

For example, this is the code i have written.

using System;
using System.IO;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using HtmlAgilityPack;
using System.Data;
using System.Net;
using System.Text.RegularExpressions;

namespace AutoApp_Win32Server
{
    class Program
    {
        static void Main(string[] args)
        {
            Console.WriteLine("APP.\n\n");
            Console.WriteLine("APP.");

            HtmlWeb web = new HtmlWeb();
            HtmlDocument doc1 = web.Load("http://brandstofprijzen.info/?postcode=&plaats=8801&afstand=25&brandstof=Diesel&zoeken=Zoeken");

            HtmlNodeCollection tables = doc1.DocumentNode.SelectNodes("/html/body/center/table");
            HtmlNodeCollection rows = tables[13].SelectNodes(".//tr");

            string makeSpace = " ";

            for (int i = 1; i < rows.Count; ++i)
            {
                HtmlNodeCollection cols = rows[i].SelectNodes(".//td");

                string nawhtml = cols[0].InnerHtml;
                string brandstof = cols[1].InnerText;
                string prijs = cols[2].InnerText;
                string datum = cols[3].InnerText;

                Regex match1 = new Regex(@"<b>\s*(.+?)\s*</b><br>");
                Match naamtankstation = match1.Match(nawhtml);
                Console.WriteLine("Naam         : " + naamtankstation.Groups[1].Value);


                Regex match2 = new Regex(@"</b><br>\s*(.+?)\s*<br>");
                Match straattankstation = match2.Match(nawhtml);
                Console.WriteLine("Straat       : " + straattankstation.Groups[1].Value);

                Regex match3 = new Regex(@"<br>{2,}\s*(.+?)\s*<br>");
                Match postcodetankstation = match3.Match(nawhtml);
                Console.WriteLine("Postcode     : " + postcodetankstation.Groups[1].Value);

             //   Console.WriteLine("naw          : " + nawhtml);


                Console.WriteLine("Brandstof    : " + brandstof);
                Console.WriteLine("Prijs        : " + prijs);
                Console.WriteLine("Datum        : " + datum);
                Console.WriteLine(makeSpace);

                Console.WriteLine(makeSpace);
            }

            Console.ReadKey();

        }
    }
}
djv
  • 15,168
  • 7
  • 48
  • 72

2 Answers2

0

you can try this

<br>([\w]+) ([\w]+)<br>
0

Your regex doesn't work because of the lazy evaluator (?); it forces your evaluation to skip the spaces between Postalcode and Placename.

Try simply using <br>\s(.+)<br>. However this will match Street also, so you may want to tweak your code. AFAIK I think that HTMLAgilityPack splits along linebreaks so if the format is always the same you could try to select your fields by index instead.

samy
  • 14,832
  • 2
  • 54
  • 82