6

I'm just learning how to use regex's:

I'm reading in a text file that is split into sections of two different sorts, demarcated by <:==]:> and <:==}:> . I need to know for each section whether it's a ] or } , so I can't just do

pattern.compile("<:==]:>|<:==}:>"); pattern.split(text)

Doing this:

pattern.compile("<:=="); pattern.split(text)

works, and then I can just look at the first char in each substring, but this seems sloppy to me, and I think I'm only resorting to it because I'm not fully grasping something I need to grasp about regex's:

What would be the best practice here? Also, is there any way to split a string up while leaving the delimiter in the resulting strings- such that each begins with the delimiter?

EDIT: the file is laid out like this:

Old McDonald had a farm 
<:==}:> 
EIEIO. And on that farm he had a cow 
<:==]:> 
And on that farm he....
drew moore
  • 31,565
  • 17
  • 75
  • 112
  • My initial solution (enclosing the delimiter in a capturing group) appears not to work in Java (other languages like Python would have worked), so I need to rethink this. Could you provide a small sample file? I'm not quite sure I understand how exactly the sections are delimited. Are they surrounded by pairs of delimiters, or does a section start after one delimiter and end with the next delimiter? – Tim Pietzcker Nov 22 '13 at 11:33
  • @TimPietzcker Yeah I had the same realization. See my edit for an example of how the file's laid out. They are not pairs of delimeters, the end of each is signaled by the start of the next. Also, I should note that <:?:> signify several other types of tags – drew moore Nov 22 '13 at 11:38
  • So what exactly do you want as output? The section of text along with either a `]` or `}`? If so then what do you want for the first/last section that is not delimited? Do you need the section of text or is it enough to just have the delimiters? – OGHaza Nov 22 '13 at 11:52

1 Answers1

6

It may be a better idea not to use split() for this. You could instead do a match:

List<String> delimList = new ArrayList<String>();
List<String> sectionList = new ArrayList<String>();
Pattern regex = Pattern.compile(
    "(<:==[\\]}]:>)     # Match a delimiter, capture it in group 1.\n" +
    "(                  # Match and capture in group 2:\n" +
    " (?:               # the following group which matches...\n" +
    "  (?!<:==[\\]}]:>) # (unless we're at the start of another delimiter)\n" +
    "  .                # any character\n" +
    " )*                # any number of times.\n" +
    ")                  # End of group 2", 
    Pattern.COMMENTS | Pattern.DOTALL);
Matcher regexMatcher = regex.matcher(subjectString);
while (regexMatcher.find()) {
    delimList.add(regexMatcher.group(1));
    sectionList.add(regexMatcher.group(2));
} 
Tim Pietzcker
  • 328,213
  • 58
  • 503
  • 561
  • 1
    Looks like you grokked this completely. I think the answer to all your questions is Yes. For details, check out this [regular expressions tutorial by Jan Goyvaerts](http://www.regular-expressions.info/tutorial.html), especially the sections on [capturing groups](http://www.regular-expressions.info/brackets.html) and [lookaround assertions](http://www.regular-expressions.info/lookaround.html). As for your last question, can you be more specific? Perhaps in the form of another question since comments are not really well suited for this? – Tim Pietzcker Nov 22 '13 at 13:28
  • I like this example with the comments, but note that a static regex is usually compiled statically (once) and reused multiple times. Also see: http://stackoverflow.com/questions/4935216/shouldnt-static-patterns-always-be-static also see http://stackoverflow.com/questions/1360113/is-java-regex-thread-safe – Christophe Roussy Apr 19 '16 at 10:28