0

I am attempting to fix up some regex which searches for html elements with attributes named "langtoken" as well as other permutations such as "langtoken_title". If I search for langtoken by itself by using a word boundary it returns the results from the ~1,750,000 character string in around 0.2 seconds however if I omit to word boundary to capture the langtoken_title attributes this spikes to around 95 seconds.

The regex I originally had was

<([^>\s]+) [^>]*langtoken([^>]*?(\\*)?/>|.*?<(\\*)?/\1>)

So far my attempts have changed it to

<([^>\s]+) [^>]*langtoken(?:\b|_)(?:[^>]*/>|.*?</\1>)

I should note that in the string being searched (an html document) there are 1431 occurrences of elements with the langtoken attribute and only 5 of the langtoken_title attribute. I believe it is the near match that is causing the issue but I am not sure.

This is my first foray into regex and any help would be appreciated in creating a more efficient expression.

Syzygy
  • 1
  • 2
  • It is to parse HTML before it is sent to the client to find these langtokens for internationalisation purposes. It is intended to capture the entire element which contains one of these langtoken attributes. The HTML is passed as a string and the regex runs against that to find a collection of matches. – Syzygy Nov 19 '14 at 11:32
  • You should perhaps read this answer: http://stackoverflow.com/a/1732454/1450890 – Rook Nov 19 '14 at 11:41
  • 1
    And then perhaps look at this: http://htmlagilitypack.codeplex.com/ – Rook Nov 19 '14 at 11:43
  • It's not recommended parsing html with regex, however would something without alternation like `<(\w+)\s+[^>]*langtoken.*?/\1?>` improve performance? Or split it in two regexes, one extra for the singleton tags. – Jonny 5 Nov 19 '14 at 11:45
  • Rook - What have I done... What unearthly task am I to undertake? I , sadly, am not the one who wrote the code initially, I am merely the one tasked with improving its performance. Jonny - Thanks for the advice however your suggestion did not return any results. I have considered using two expressions but even searching for langtoken_ still takes the 95 seconds. XML parsing may be the only option. – Syzygy Nov 19 '14 at 11:58

0 Answers0