I'm quite new to regex and I have to split EDI files for a loader I'm developing. If you are not familiar with it, here is an example of 2 segments (modified to explain all so it's not a real example):
APD+EM2:0:16?'30::6+++++++DA'APD+EM2:0:1630::6+++++++DA'
End of lines are marked with '
and I ignore if there's an escaping char which is the question mark - ?'
is to ignore for example for the end of a line. \+
and :
are the main delimiters (when data are composite like an address).
The split for the segments works fine, but I have issues with the other delimiters. I would like to have a String[]
with all the elements, even if they are empty, because I need to process it after (insert in DB). With the example above, I would like to have a tab like this:
APD+EM2:0:16?'30::6+++++++DA
would transform into:
{"APD","EM2","0","16?'30","","6","","","","","","","DA"}
Currently with my code, I get a tab like this:
{"APD","EM2","0","16?'30","6","DA"}
Can I please have some help with my regex? Making it match ++
and ::
is beyond my skills for now. I need to remove the escaping characters as well, but I'll work on that on my own.
BTW, I need to process a lot of data - 300gb of raw text - so if what I do is bad performance-wise, don't hesitate to tell me - like per example split with both +
and :
at the same time.
The EDIFACT format is not something discussed a lot around here, and the few examples I found were not working for me.
Current code:
private final String DATA_ELEMENT_DELIMITER = "(?<!\\?)\\+";
private final String DATA_COMPOSITE_ELEMENT_DELIMITER = "(?<!\\?):";
private String[] split (String segments){
return Stream.of(segments)
.flatMap(Pattern.compile(DATA_ELEMENT_DELIMITER)::splitAsStream)
.flatMap(Pattern.compile(DATA_COMPOSITE_ELEMENT_DELIMITER)::splitAsStream)
.toArray(String[]::new);
}
EDIT : The code I'm running - BTW, I'm running on Java 8, not sure it makes a difference though:
import java.util.Arrays;
import java.util.regex.Pattern;
import java.util.stream.Stream;
public class Split {
public static void main(String[] args) {
Split s = new Split();
System.out.println(
Arrays.toString(
s.split("APD+EM2:0:16?'30::6+++++++DA'")
)
);
}
private static final Pattern DATA_ELEMENT_DELIMITER = Pattern.compile("(?<!\\?)\\+");
private static final Pattern DATA_COMPOSITE_ELEMENT_DELIMITER = Pattern.compile("(?<!\\?):");
private String[] split (String segments){
return Stream.of(segments)
.flatMap(DATA_ELEMENT_DELIMITER::splitAsStream)
.flatMap(DATA_COMPOSITE_ELEMENT_DELIMITER::splitAsStream)
.toArray(String[]::new);
}
}
Here is the output i get :
[APD, EM2, 0, 16?'30, , 6, DA']
EDIT EDIT
After trying to run this code in an online Java 11 compiler, the output is correct, but not on Java 8.