0

I`m using the code below to search for the data in a web page and return the data to da datagridview.

When I use it with a web page that has many rows like 100, some time it will return a buggy line like this: CaucaiaCE

and should be only Caucaia

Why it happens only in 2 lines out of 100?

this is the html I`m searching http://pastie.org/8220836

{
    int i = 0;
    Match matchLogradouro = Regex.Match(pagina, "<td width=\"268\" style=\"padding: 2px\">(.*)</td>");
    Match matchBairroCidade = Regex.Match(pagina, "<td width=\"140\" style=\"padding: 2px\">(.*)</td>");
    Match matchEstado = Regex.Match(pagina, "<td width=\"25\" style=\"padding: 2px\">([A-Z]{2})</td>");
    Match matchCep = Regex.Match(pagina, "<td width=\"65\" style=\"padding: 2px\">(.*)</td>");
    int z = Regex.Matches(pagina, "detalharCep").Count;
    while (z > i -1)
    {    
        dataGridView1.Rows.Add(matchLogradouro.Groups[1].Value);
        matchLogradouro = matchLogradouro.NextMatch();
        dataGridView1.Rows[i].Cells[1].Value = matchBairroCidade.Groups[1].Value;
        matchBairroCidade = matchBairroCidade.NextMatch();
        dataGridView1.Rows[i].Cells[2].Value = matchBairroCidade.Groups[1].Value;
        matchBairroCidade = matchBairroCidade.NextMatch();
        dataGridView1.Rows[i].Cells[3].Value = matchEstado.Groups[1].Value;
        matchEstado = matchEstado.NextMatch();

        dataGridView1.Rows[i].Cells[4].Value = matchCep.Groups[1].Value;
        matchCep = matchCep.NextMatch();
        i++;
    }
}
Sergey Berezovskiy
  • 232,247
  • 41
  • 429
  • 459
Aman
  • 582
  • 2
  • 12
  • 29
  • I see a lot of greedy operators... try using `(.*?)` instead, in each of your regexes. – Jerry Aug 09 '13 at 06:11
  • 11
    [Don't use Regex to parse HTML!](http://stackoverflow.com/a/1732454/1336590) – Corak Aug 09 '13 at 06:11
  • 1
    Damn @Corak, I was just about to write the same thing! ^_^ In that case, to summarize... Have you tried using an XML parser instead? :D – MBender Aug 09 '13 at 06:13
  • 1
    @Shaamaan - this. Or the [Html Agility Pack](http://htmlagilitypack.codeplex.com/) – Corak Aug 09 '13 at 06:15
  • Using the (.*?) solved the problem, but I decided to stop being lazy and learn how to use the Html Agility Pack. Also the example from lazyberezovsky helped a lot! – Aman Aug 09 '13 at 23:21

1 Answers1

7

Create class like (sorry, I don't know Portuguese to understand what kind of data should be in your class)

public class Foo // I believe it should be something like Address
{
    public string Logradouro { get; set; }
    public string BairroCidade1 { get; set; }
    public string BairroCidade2 { get; set; }
    public string Estado { get; set; } // this should be State
    public string Cep { get; set; }
}

And use HtmlAgilityPack to parse your html document

HtmlDocument doc = new HtmlDocument();
doc.Load(html_file_name); // or doc.LoadHtml(html_string)

var foos = from row in doc.DocumentNode.SelectNodes("//tr[td]")
           let cells = row.SelectNodes("td").Select(td => td.InnerText).ToArray()
           where cells.Length > 4
           select new Foo {
               Logradouro = cells[0],
               BairroCidade1 = cells[1],
               BairroCidade2 = cells[2],
               Estado = cells[3],
               Cep = cells[4]
           };
Sergey Berezovskiy
  • 232,247
  • 41
  • 429
  • 459