I want to extract html table with the HTMLAgilityPack. Because the website I want extract data from has put a name, address, postalcode and city in the same string, I used the
string nawhtml = cols[0].InnerHtml;
to get the html code and now i want to use regex to separate the name, street, postalcode and placename and put it in separate strings in c#. The code I am getting from HTMLAgilibilitypack is this:
<b>Name</b><br>
Street<br>
Postalcode Placename<br>
This is the code written already:
Regex match1 = new Regex(@"<b>\s*(.+?)\s*</b><br>");
Match naamtankstation = match1.Match(nawhtml);
Console.WriteLine("Naam : " + naamtankstation.Groups[1].Value);
Regex match2 = new Regex(@"</b><br>\s*(.+?)\s*<br>");
Match straattankstation = match2.Match(nawhtml);
Console.WriteLine("Straat : " + straattankstation.Groups[1].Value);
Regex match3 = new Reg**strong text**ex(@"<br>{2,}\s*(.+?)\s*<br>");
Match postcodetankstation = match3.Match(nawhtml);
Console.WriteLine(postcodetankstation.Groups[1].Value);
But the last regex doesn't work. This is not the only thing I tried.
How can I make a regex match witch understands that i want the postalcode and placename in separate strings?
For example, this is the code i have written.
using System;
using System.IO;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using HtmlAgilityPack;
using System.Data;
using System.Net;
using System.Text.RegularExpressions;
namespace AutoApp_Win32Server
{
class Program
{
static void Main(string[] args)
{
Console.WriteLine("APP.\n\n");
Console.WriteLine("APP.");
HtmlWeb web = new HtmlWeb();
HtmlDocument doc1 = web.Load("http://brandstofprijzen.info/?postcode=&plaats=8801&afstand=25&brandstof=Diesel&zoeken=Zoeken");
HtmlNodeCollection tables = doc1.DocumentNode.SelectNodes("/html/body/center/table");
HtmlNodeCollection rows = tables[13].SelectNodes(".//tr");
string makeSpace = " ";
for (int i = 1; i < rows.Count; ++i)
{
HtmlNodeCollection cols = rows[i].SelectNodes(".//td");
string nawhtml = cols[0].InnerHtml;
string brandstof = cols[1].InnerText;
string prijs = cols[2].InnerText;
string datum = cols[3].InnerText;
Regex match1 = new Regex(@"<b>\s*(.+?)\s*</b><br>");
Match naamtankstation = match1.Match(nawhtml);
Console.WriteLine("Naam : " + naamtankstation.Groups[1].Value);
Regex match2 = new Regex(@"</b><br>\s*(.+?)\s*<br>");
Match straattankstation = match2.Match(nawhtml);
Console.WriteLine("Straat : " + straattankstation.Groups[1].Value);
Regex match3 = new Regex(@"<br>{2,}\s*(.+?)\s*<br>");
Match postcodetankstation = match3.Match(nawhtml);
Console.WriteLine("Postcode : " + postcodetankstation.Groups[1].Value);
// Console.WriteLine("naw : " + nawhtml);
Console.WriteLine("Brandstof : " + brandstof);
Console.WriteLine("Prijs : " + prijs);
Console.WriteLine("Datum : " + datum);
Console.WriteLine(makeSpace);
Console.WriteLine(makeSpace);
}
Console.ReadKey();
}
}
}
Kanaalstraat 22
8601GA SNEEK
This is going well. But the regex match is not good. – Bogdan van der Tol Jan 23 '15 at 20:19
Postalcode Placename
` with an html parser? Read the question carefully. – EZI Jan 23 '15 at 20:38