0

I get an xml as plain unformatted text blob. I have to make some replacements and I use regex find and replace. For example:

<MeasureValue><Text value="StartCalibration" /></MeasureValue>

has to be converted to

<MeasureValue type="Text" value="StartCalibration"/>

The regex I wrote was

<MeasureValue><((\w*)\s+value="(.*?)".*?)></MeasureValue>

And the replacement part was:

<MeasureValue type="$2" value="$3"/>

Here a link showing the same.

The issue is that in a file having 370 such occurrences, I get out of memory error. I have heard of the so called greedy regex patterns and wondering if this can be the case plaguing me. If this is already memory efficient then I will leave it as it is and try to increase the server memory. I have to process thousands of such documents.

EDIT: This is part of script for Logstash from Elasticsearch. As per documentation, Elasticsearch uses Apache Lucene internally to parse regular expressions. Not sure if that helps.

Toto
  • 89,455
  • 62
  • 89
  • 125
NotAgain
  • 1,927
  • 3
  • 26
  • 42
  • 2
    In what language are you implementing this regex? – CertainPerformance Apr 29 '20 at 02:28
  • This is part of script for Logstash from Elasticsearch. As per documentation, Elasticsearch uses Apache Lucene internally to parse regular expressions. Not sure if that helps. I will keep looking and add more concrete info if I find it. – NotAgain Apr 29 '20 at 02:33
  • 2
    While the pattern could be improved (you can make the pattern fail faster by using negative character sets instead of repeating lazily), I don't think it would cause catastrophic backtracking or anything like that, it's not *terribly* inefficient. But, better to use an XML parser instead of a regex in these sorts of situations – CertainPerformance Apr 29 '20 at 02:34
  • 1
    NotYetAgain, when asked for clarification, as @Certain has, it's best to respond by editing the question (and adding tags, if appropriate, as here), rather than elaborating in comments. Questions should be self-contained and readers should not be expected to read all comments in order to understand the question. – Cary Swoveland Apr 29 '20 at 03:34
  • You should be using an xml package to iterate through your tags and build a new tree and minimize any regex application to small substrings within the text attributes. Also make sure you're not holding references to any objects you only need for a brief period.. for instance, stacking up xml trees in a list... – Todd Apr 29 '20 at 04:36
  • 2
    Applying regex to html/xml can make some people lose their mind: https://stackoverflow.com/a/1732454/7915759 – Todd Apr 29 '20 at 04:44
  • @Todd that one is a classic. A gem in the annals of Stackoverflow. – NotAgain Apr 29 '20 at 05:00

1 Answers1

1

As a rule of thumb, specificity is positively correlated with efficiency in regex. So, know your data and build something to surgically match it.

The more specific you build your regex, like literally writing down the pattern (and usually ending up with a freak regex), the fewer resources it will take due to the fewer "possibilities" it can match in your data.

To be more precise, imagine we are trying to match a string

2014-08-26 app[web.1]: 50.0.134.125

Approaches such as

(.*) (.*) (.*)

leaves it too open and prone to match with MANY different patterns and combinations, and thus takes a LOT more to process its infinite possibilities. check here https://regex101.com/r/GvmPOC/1

On the other han you could spend a little more time building a more elaborated expression such as:

^[0-9]{4}\-[0-9]{2}-[0-9]{2} app\[[a-zA-Z0-9.]+\]\: [0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}$`

and I agree, it is horrible but much more precise. It won't waste your precious resources finding unnecessary stuff. check here https://regex101.com/r/quz7fo/1

Another thing to have in mind is: operators such as * or + do a scan operation, which depending on the size of your string, might take some time. Also, whenever is possible, specifying the anchors ^$ also help the script not to try to find too many matches within the same string.


Bringing it to your reality...

If we have to use regex.

The million-dollar question is, how can we turn your regex into something more precise?

Since there is no limit to tag name lengths in XML... there is no way to make it utterly specific :(

  • We could try to specify what characters to match and avoid . and \w. So substitute it to something more like a-zA-Z is preferrable. Also making use of negative classes [^] would help to narrow down the range of possibilities.

  • Avoiding * and ? and try to put a quantifier {} (although I don't know your data to make this decision). And as I stated above, there is no limit in XML for this.

  • I didn't understand precisely the function of the ? in your code, so removing it is something less to process.

Ended up with something like

<(([a-zA-Z]+) value="([^"]*)"[^<>]*)>

Not many changes though. You can try to measure it to see if there was any improvement.

But perhaps the best approach is not to use regex at all :( I don't know the language you are working with, but if it is getting complicated with the processing time, I would suggest you to not use regex and try some alternative.

If there is a slight possibility to use a xml parser it would be preferable.

https://softwareengineering.stackexchange.com/questions/113237/when-you-should-not-use-regular-expressions

Sorry if it wasn't as conclusive as you might have expected, but the field for working on it is likewise really open.

Caio Oliveira
  • 1,243
  • 13
  • 22
  • 1
    Thanks for the replying. I will try to come up with a more precise targeted regex. Will keep option of throwing more memory at it for the last. – NotAgain Apr 29 '20 at 05:06