1

here is my function with using regex. it's working corectly but it's taking tags very slowly. I think it's searching html code character by character.So it works slowly. Is there any solution of working slow.

string s = Sourcecode(richTextBox6.Text);
        // <a ... > </a> tagları arasını alıyor.(taglar dahil)
        Regex regex = new Regex("(?i)<a([^>]+)>(.+?)</a>");
        string gelen = s;
        string inside = null;
        Match match = regex.Match(gelen);
        if (match.Success)
        {
            inside= match.Value;
            richTextBox2.Text = inside;
        }
        string outputStr = "";
        foreach (Match ItemMatch in regex.Matches(gelen))
        {
            Console.WriteLine(ItemMatch);
            inside = ItemMatch.Value;
            //boşluk bırakıp al satır yazıyor 
            outputStr += inside + "\r\n";
        }
        richTextBox2.Text = outputStr;
  • 2
    Don't use Regular Expressions. Use a proper HTML parsing library like Html Agility Pack. You'll see a tenfold increase in speed. – Simon Whitehead Jan 09 '14 at 00:43
  • any different idea why it is taking tags slow? – believeitornot Jan 09 '14 at 00:43
  • this could slow it down if there are lots and lots to append. `outputStr += inside + "\r\n";` – Valamas Jan 09 '14 at 00:43
  • I love these questions.... Refer to.. http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 for HTML regex parsing – Nico Jan 09 '14 at 00:45
  • 3
    Arghhhh! Yet another "Parsing HTML with regex problem" question! Does no one search any more? The top hit here for "regex html parsing" finds [Using regular expressions to parse HTML: why not?](http://stackoverflow.com/q/590747) as the top result. – Ken White Jan 09 '14 at 00:46
  • I provided a sample of another way to get information from HTML in another one of your questions yesterday. http://stackoverflow.com/a/20984718/1967692 – jiverson Jan 09 '14 at 00:51

2 Answers2

1

Change outputStr to a StringBuilder, if you are appending very many items this will increase your speed. As already mentioned parsing HTML with a regex might be an issue (depends a lot on your input).

Dweeberly
  • 4,668
  • 2
  • 22
  • 41
0

The trouble with parsing HTML is that it isn't an exact science. If it was XHTML that you were parsing, then things would be a lot easier.
Because HTML isn't necessarily well-formed XML you will come into lots of problems trying to parse it.
It almost needs to be done on a site-by-site basis.

You should not parse HTML using Regex.(Although you can use compiled Regex in your above code, to make it a bit quick.)
Regex is not build for parsing HTML. You can use a third-party library for parsing HTML which are built specifically for this purpose.
List of HTML Parsing Libraries
If you don't want to use 3rd party libraries, then you can use the System.Windows.Forms.WebBrowser for this purpose.
You can also use Fizzler, it uses HTML agility pack, but has extended support for jQuery Then there is Majestic-12 HTML Parse, which is very quick.
You can also use TidyNet.Tidy to convert the HTML to XHTML, and then use an XML parser.

Check the following example on how improper usage of Regex can degrade performance.

Community
  • 1
  • 1
Pratik Singhal
  • 6,283
  • 10
  • 55
  • 97