1

I'm trying to convert a string into a map of values using regex and known delimiters. The code I have works, but if I use a delimiter which is a substring of another delimiter, it is not parsed (properly).

Let's cut straight to some sample input, erroneous output, expected output, and code!

Sample input: "Artist: foo bar foooo Title: bar fooo bar Dimensions: x z y Framed dimensions: y z x" (as you can see there is "Dimensions" and "Framed dimensions")

Erroneous output: {Artist:=foo bar foooo, Title:=bar fooo bar, Dimensions:=x z y, dimensions:=y z x} (Framed dimensions got caught under dimensions!)

Expected output: Artist:=foo bar foooo, Title:=bar fooo bar, Dimensions:=x z y, Framed dimensions:=y z x}

Code example:

String DELIMITER = "[Aa]rtist:|[Tt]itle:|[Ff]ramed [Dd]imensions:|[Dd]imensions:"
...
public Map<String, String> parseToMap(String str) {
    Map<String, String> itemMap = new LinkedHashMap<>();
    String infos[] = str.split("(?=" + DELIMITER + ')'); //split at delimiters
        for(String info : infos) {
            try {
                String[] tmp = info.split("(?<=" + DELIMITER + ')'); //split to key/val pair
                itemMap.put(tmp[0].trim(), tmp[1].trim());
            } catch (IndexOutOfBoundsException e) {
                //Skip if no key/val pair
            }
        }
    return itemMap;
}

I also feel like this is a bit hackish. If there is a more elegant solution, I'd be glad to hear it. Although I can always make a trip to CodeReview if we can just get this working for now :)

EDIT: I need every word from delimiter to delimiter, not just the word following a delimiter.

MeetTitan
  • 3,383
  • 1
  • 13
  • 26

3 Answers3

3

Rather than split operation use this regex with 2 captured groups:

(?<key>[\w\s]+:)\s*(?<value>.+?)\s*(?=(?:[Aa]rtist|[Tt]itle|(?:[Ff]ramed )?[Dd]imensions):|$)

RegEx Demo

Code:

final String regex = "(?<key>[\\w\\s]+:)\\s*(?<value>.+?)\\s*(?=(?:[Aa]rtist|[Tt]itle|(?:[Ff]ramed )?[Dd]imensions):|$)";
final String string = "Artist: foo Title: bar Dimensions: x Framed dimensions: y";

final Pattern pattern = Pattern.compile(regex);
final Matcher m = pattern.matcher(string);

Map<String, String> itemMap = new LinkedHashMap<>();
while (m.find()) {
    itemMap.put(m.group("key"), m.group("value"));
}

System.out.println("itemMap: " + itemMap);
anubhava
  • 761,203
  • 64
  • 569
  • 643
2

Your regex is a non-consuming positive lookahead that tests each position inside a string, and thus, it can match overlapping strings.

You may use a matching approach to capture the delimiters into Group 1 and then any char that does not start any of the delimiters:

public static Map<String, String> parseToMap(String str) {
    String DESCRIPTION_DELIMITER = "[Aa]rtist:|[Tt]itle:|[Ff]ramed [Dd]imensions:|[Dd]imensions:";
    Map<String, String> itemMap = new LinkedHashMap<>();
    Pattern p = Pattern.compile("(" + DESCRIPTION_DELIMITER + ")((?:(?!" + DESCRIPTION_DELIMITER + ").)*)"); //split to key/val pair
    Matcher m = p.matcher(str);
    while(m.find()) {
        itemMap.put(m.group(1).trim(), m.group(2).trim());
    }
    return itemMap;
}

See the Java demo.

The regex will look like

([Aa]rtist:|[Tt]itle:|[Ff]ramed [Dd]imensions:|[Dd]imensions:)((?:(?![Aa]rtist:|[Tt]itle:|[Ff]ramed [Dd]imensions:|[Dd]imensions:).)*)

See the online demo.

Here,

  • ([Aa]rtist:|[Tt]itle:|[Ff]ramed [Dd]imensions:|[Dd]imensions:) - Group 1 matching any of the delimiters
  • ((?:(?![Aa]rtist:|[Tt]itle:|[Ff]ramed [Dd]imensions:|[Dd]imensions:).)*) - a tempered greedy token matching any char other than a line break char (.), 0+ occurrences (*), that does not start any of the delimiter character sequences.
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
0

If the input is expected to be always in the following format
Artist: foo Title: bar Dimensions: x Framed dimensions: y

i.e, "D" is capital in Dimensions always, you can use String DELIMITER = "[Aa]rtist:|[Tt]itle:|[Ff]ramed [Dd]imensions:|Dimensions:"; instead of String DELIMITER = "[Aa]rtist:|[Tt]itle:|[Ff]ramed [Dd]imensions:|[Dd]imensions:"

A J
  • 1,439
  • 4
  • 25
  • 42