1

I'm quite new to regex and I have to split EDI files for a loader I'm developing. If you are not familiar with it, here is an example of 2 segments (modified to explain all so it's not a real example):

APD+EM2:0:16?'30::6+++++++DA'APD+EM2:0:1630::6+++++++DA'

End of lines are marked with ' and I ignore if there's an escaping char which is the question mark - ?' is to ignore for example for the end of a line. \+ and : are the main delimiters (when data are composite like an address).

The split for the segments works fine, but I have issues with the other delimiters. I would like to have a String[] with all the elements, even if they are empty, because I need to process it after (insert in DB). With the example above, I would like to have a tab like this:

APD+EM2:0:16?'30::6+++++++DA

would transform into:

{"APD","EM2","0","16?'30","","6","","","","","","","DA"}

Currently with my code, I get a tab like this:

{"APD","EM2","0","16?'30","6","DA"}

Can I please have some help with my regex? Making it match ++ and :: is beyond my skills for now. I need to remove the escaping characters as well, but I'll work on that on my own.

BTW, I need to process a lot of data - 300gb of raw text - so if what I do is bad performance-wise, don't hesitate to tell me - like per example split with both + and : at the same time.

The EDIFACT format is not something discussed a lot around here, and the few examples I found were not working for me.

Current code:

private final String DATA_ELEMENT_DELIMITER = "(?<!\\?)\\+";
private final String DATA_COMPOSITE_ELEMENT_DELIMITER = "(?<!\\?):";

private String[] split (String segments){       
    return Stream.of(segments)
            .flatMap(Pattern.compile(DATA_ELEMENT_DELIMITER)::splitAsStream)
            .flatMap(Pattern.compile(DATA_COMPOSITE_ELEMENT_DELIMITER)::splitAsStream)
            .toArray(String[]::new);
}

EDIT : The code I'm running - BTW, I'm running on Java 8, not sure it makes a difference though:

import java.util.Arrays;
import java.util.regex.Pattern;
import java.util.stream.Stream;
public class Split {

    public static void main(String[] args) {
        Split s = new Split();
        System.out.println(
                Arrays.toString(
                    s.split("APD+EM2:0:16?'30::6+++++++DA'")
                )
            );
    }
    
    
    private static final Pattern DATA_ELEMENT_DELIMITER = Pattern.compile("(?<!\\?)\\+");
    private static final Pattern DATA_COMPOSITE_ELEMENT_DELIMITER = Pattern.compile("(?<!\\?):");
    
    private String[] split (String segments){       
        return Stream.of(segments)
                .flatMap(DATA_ELEMENT_DELIMITER::splitAsStream)
                .flatMap(DATA_COMPOSITE_ELEMENT_DELIMITER::splitAsStream)
                .toArray(String[]::new);
    }
}

Here is the output i get :

[APD, EM2, 0, 16?'30, , 6, DA']

EDIT EDIT

After trying to run this code in an online Java 11 compiler, the output is correct, but not on Java 8.

double-beep
  • 5,031
  • 17
  • 33
  • 41
  • In general, I'd recommend not to use regular expressions for parsing Edifact document unlses you can ensure that your character set does not use (variable-length) multi-byte encodings such as `UNOX`, `UNOY` or `KECA` i.e. If you stll want to make use of regular expressions the whole expression can become pretty unreadable quickly as [this](https://stackoverflow.com/questions/48111632/regular-expression-for-edi-file) and [that](https://mycsharp.de/forum/threads/105437/geloest-komplexes-regex-in-bezug-auf-edi?page=1#forumpost-3727017) sample showcase – Roman Vottner Mar 08 '21 at 15:51
  • not clear what the problem is, feeding `"APD+EM2:0:16?'30::6+++++++DA"` into your code is returning `[APD, EM2, 0, 16?'30, , 6, , , , , , , DA]` ([Ideone](https://ideone.com/BgsMuj)) –  Mar 08 '21 at 15:54
  • @Roman what would you recommend then ? frameworks ? user15244370 well ... on my side it's not what is happening haha – Paddy Mariage Mar 08 '21 at 16:07
  • I'm currently working on a Java-native Edifact parser that might be open-sourced when it is ready (depends on my employee). It is basically just a port with enhancement of the [node-edifact](https://github.com/tdecaluwe/node-edifact) or [ts-edifact](https://github.com/RovoMe/ts-edifact) library. Both of them do not really support multi-byte encodings unfortunately for now. Until then the best choices might be Smooks and X12 as recommended [here](https://stackoverflow.com/questions/2794262/how-go-i-parse-edifact-in-java) – Roman Vottner Mar 08 '21 at 16:20
  • Thanks for your answer. Does your project have a name so i can check in the future what it becomes and maybe integrate it in my project(s) ? I have issues using libraries to read / parse edifact as the one i receive does not follow standard format like X12, it's just raw data :/ I tried using StAEDI 1st but i have no UNA or UNB in my EDI so doesn't want to read and i didn't try but i guess it'll be the same for other frameworks - As of now regex is my way to go because of this reason – Paddy Mariage Mar 08 '21 at 16:37

1 Answers1

0

My first note is that for improved performance, you definitely want to compile the Patterns once and reuse the instance:

private static final Pattern DATA_ELEMENT_DELIMITER = Pattern.compile("(?<!\\?)\\+");
private static final Pattern DATA_COMPOSITE_ELEMENT_DELIMITER = Pattern.compile("(?<!\\?):");
// ...
.flatMap(DATA_ELEMENT_DELIMITER::splitAsStream)
.flatMap(DATA_COMPOSITE_ELEMENT_DELIMITER::splitAsStream)

Second, as @user15244370 mentioned, running your code does produce the output you are looking for. I ran it like this:

System.out.println(
    Arrays.toString(
        split("APD+EM2:0:16?'30::6+++++++DA'APD+EM2:0:1630::6+++++++DA'")
    )
);

and got the output:

[APD, EM2, 0, 16?'30, , 6, , , , , , , DA'APD, EM2, 0, 1630, , 6, , , , , , , DA']

Assuming there is some difference between what you have posted and what you are actually running, the documentation for splitAsStream mentions:

Trailing empty strings will be discarded and not encountered in the stream.


Are you doing any additional processing after the call to split? And how are you printing the array? Is it possible that the method you are using to print the string[] may be removing empty strings? As far as I can tell, your implementation should function as you intend.

Jeff Brower
  • 594
  • 3
  • 13
  • Thanks for the the compile, sure will be better to compile once each than every time I tried to move it outside of the class i wanted to be sure nothing else interfers with my code. I added the code i'm using on original post – Paddy Mariage Mar 10 '21 at 10:22