0

I am writing a CSV parser and in order to detect the different data types I can expect to get on the files, I have a Map for each data type, each containing the regexes that I defined as valid and recognizable. For instance, for my Integer values, here is my Map:

Map<String, String> integerFormatRegexps = new HashMap<String, String>();
integerFormatRegexps.put("^[1-9]\\d{1,9}$", "##0");
integerFormatRegexps.put("^-[1-9]\\d{1,9}$", "-##0");
integerFormatRegexps.put("^0$", "0");

Now, I've seen several examples here in SO where instead of having these regexes separated, they use Alternations, where instead of three regex, I could use just one:

Map<String, String> integerFormatRegexps = new HashMap<String, String>();
integerFormatRegexps.put("^[1-9]\\d{1,9}$|^-[1-9]\\d{1,9}$|^0$", "Integer");

My questions is which of the two approaches would be more efficient in general, when matching patterns in Java? Iterating through the separate simpler regexes to find a match, or matching against just one, more complex regex?

Community
  • 1
  • 1
carlossierra
  • 4,479
  • 4
  • 19
  • 30
  • 3
    what prevents you from benchmarking both variants by yourself? – Iłya Bursov Apr 27 '16 at 22:24
  • 1
    Here is a [similar question for Perl](http://stackoverflow.com/questions/36420517/is-it-faster-to-use-alternation-than-subsequent-replacements-in-regular-expressi). *Summary*: **Keep in mind that you're comparing apples and oranges**. – Wiktor Stribiżew Apr 27 '16 at 22:34
  • @Lashane from other posts here in SO (including the one referenced by Wiktor, this seems like a very specific topic, and one requiring a lot of experience. Indeed I could measure it myself. But what should I measure? Maybe my question seems very specific because I tried to provide an example, but it is a general question as the name implies: will using alternations hit performance in java regex matching? – carlossierra Apr 28 '16 at 00:34

1 Answers1

1

First, I have to say it's unlikely that efficiency will ever be an issue for you. Your regexes are relatively simple, and you appear to be using them to match the values in isolation, after they've been extracted.

That said, the thing to watch out for with alternations is when different branches can match the same characters. The best example of this is (.|\s), sometimes used by regex beginners who don't yet know about DOTALL/Singleline mode (or [\S\s], as in this question). Put that in the middle of an otherwise benign regex, use it to search a not-particularly-large text, and watch your computer go catatonic.

Your sample regex is fine, though, because every branch has to start with something different ([1-9], -, or 0). But as I said, I don't think the efficiency of the regexes is ever going to be a concern. Do whatever you think is more convenient; are more compact code and a smaller number of regexes a reasonable trade-off for larger, harder-to-maintain regexes?

One more thing: if you're using the regexes repeatedly in a tight loop, be sure to use cached Pattern objects; the cost of compiling them is significant. In fact, consider storing them in your Map as Patterns rather than Strings. For example:

Pattern integerRegex = Pattern.compiile("^[1-9]\\d{1,9}$|^-[1-9]\\d{1,9}$|^0$");

Map<String, String> integerFormatRegexps = new HashMap<String, String>();
integerFormatRegexps.put(integerRegex, "Integer");

Then you can use the static Pattern.matches() method to perform the check.

Community
  • 1
  • 1
Alan Moore
  • 73,866
  • 12
  • 100
  • 156