I have lines of data coming from a script which typically look like this (single line example):
1234567890;group1;varname1;133333337;prop1=val1;prop2=val2;prop3=val3
I need to break each line into Key-Value items for a Map, each item being separated by a separator string (;
in the example, but it can be a custom one too). The first 4 items are static, meaning that only the value is in the line, and the keys are already known. The rest is a variable number of key-value items (0 or more key=value
chunks). Please take a look at the output below first to give you an idea.
I already have two working methods to accomplish that, where both throw me the same output for a same line. I have set up a test class to demonstrate the two methods at work along with some (simple) performance analysis just out of curiosity. Take note that invalid input handling is minimum in the methods shown below.
String Splitting (using Apache Commons):
private static List<String> splitParsing(String dataLine, String separator) {
List<String> output = new ArrayList<String>();
long begin = System.nanoTime();
String[] data = StringUtils.split(dataLine, separator);
if (data.length >= STATIC_PROPERTIES.length) {
// Static properties (always there).
for (int i = 0; i < STATIC_PROPERTIES.length; i++) {
output.add(STATIC_PROPERTIES[i] + " = " + data[i]);
}
// Dynamic properties (0 or more).
for (int i = STATIC_PROPERTIES.length; i < data.length; i++) {
String[] fragments = StringUtils.split(data[i], KEYVALUE_SEPARATOR);
if (fragments.length == 2) {
output.add(fragments[0] + " = " + fragments[1]);
}
}
}
long end = System.nanoTime();
output.add("Execution time: " + (end - begin) + "ns");
return output;
}
Regex (using JDK 1.6):
private static List<String> regexParsing(String dataLine, String separator) {
List<String> output = new ArrayList<String>();
long begin = System.nanoTime();
Pattern linePattern = Pattern.compile(StringUtils.replace(DATA_PATTERN_TEMPLATE, SEP, separator));
Pattern propertiesPattern = Pattern.compile(StringUtils.replace(PROPERTIES_PATTERN_TEMPLATE, SEP, separator));
Matcher lineMatcher = linePattern.matcher(dataLine);
if (lineMatcher.matches()) {
// Static properties (always there).
for (int i = 0; i < STATIC_PROPERTIES.length; i++) {
output.add(STATIC_PROPERTIES[i] + " = " + lineMatcher.group(i + 1));
}
Matcher propertiesMatcher = propertiesPattern.matcher(lineMatcher.group(STATIC_PROPERTIES.length + 1));
while (propertiesMatcher.find()) {
output.add(propertiesMatcher.group(1) + " = " + propertiesMatcher.group(2));
}
}
long end = System.nanoTime();
output.add("Execution time: " + (end - begin) + "ns");
return output;
}
Main method:
public static void main(String[] args) {
String input = "1234567890;group1;varname1;133333337;prop1=val1;prop2=val2;prop3=val3";
System.out.println("Split parsing:");
for (String line : splitParsing(input, ";")) {
System.out.println(line);
}
System.out.println();
System.out.println("Regex parsing:");
for (String line : regexParsing(input, ";")) {
System.out.println(line);
}
}
Constants:
// Common constants.
private static final String TIMESTAMP_KEY = "TMST";
private static final String GROUP_KEY = "GROUP";
private static final String VARIABLE_KEY = "VARIABLE";
private static final String VALUE_KEY = "VALUE";
private static final String KEYVALUE_SEPARATOR = "=";
private static final String[] STATIC_PROPERTIES = { TIMESTAMP_KEY, GROUP_KEY, VARIABLE_KEY, VALUE_KEY };
// Regex constants.
private static final String SEP = "{sep}";
private static final String PROPERTIES_PATTERN_TEMPLATE = SEP + "(\\w+)" + KEYVALUE_SEPARATOR + "(\\w+)";
private static final String DATA_PATTERN_TEMPLATE = "(\\d+)" + SEP + "(\\w+)" + SEP + "(\\w+)" + SEP + "(\\d+\\.?\\d*)"
+ "((?:" + PROPERTIES_PATTERN_TEMPLATE + ")*)";
Output from main method:
Split parsing:
TMST = 1234567890
GROUP = group1
VARIABLE = varname1
VALUE = 133333337
prop1 = val1
prop2 = val2
prop3 = val3
Execution time: 8695796ns
Regex parsing:
TMST = 1234567890
GROUP = group1
VARIABLE = varname1
VALUE = 133333337
prop1 = val1
prop2 = val2
prop3 = val3
Execution time: 1250787ns
Judging from the output (which I ran multiple times), it seems that the regex method is more efficient in terms of performance, even though my initial thoughts were more towards the splitting method. However, I'm not certain how representative this performance analysis is.
My questions are:
- Which of these two methods would be best or easier to work with for invalid input handling? (Ex.: static item missing, invalid format, etc.).
- Which of these methods is less likely to produce unexpected behaviour?
- Why is the regex method faster? I would have assumed the opposite since
Matcher
s andPattern
s must have a somewhat more complex logic behind them. Is my performance analysis even representative?