This is a challenge about String manipulation.
The statement below forms the GROUP BY for a SQL statement. I would like to write a method called removeDuplicates() to remove the duplicate items.
E.g.
// Comma in quotes
String source = "ADDRESS.CITY || ', UK', ADDRESS.CITY || ', US', ADDRESS.CITY || ', UK'";
String expected = "ADDRESS.CITY || ', UK', ADDRESS.CITY || ', US'";
String result = removeDuplicates(source);
assert result.equals(expected);
// Comma in quotes with escaped single quotes
String source = "ADDRESS.CITY || ', UK''s CITY', ADDRESS.CITY || ', US''s CITY', ADDRESS.CITY || ', UK''s CITY'";
String expected = "ADDRESS.CITY || ', UK''s CITY', ADDRESS.CITY || ', US''s CITY'";
String result = removeDuplicates(source);
assert result.equals(expected);
// Comma in parentheses
String source = "NAME, to_char(DATE, 'YYYY,MM,DD'), to_char(DATE, 'YYYY-MM-DD'), NAME, CITY, to_char(DATE, 'YYYY-MM-DD')";
String expected = "NAME, to_char(DATE, 'YYYY,MM,DD'), to_char(DATE, 'YYYY-MM-DD'), CITY";
String result = removeDuplicates(source);
assert result.equals(expected);
// Comma in parentheses with parentheses
String source = "NAME, to_char(DATE, ('YYYY,MM,DD')), to_char(DATE, 'YYYY-MM-DD'), NAME, CITY, to_char(DATE, 'YYYY-MM-DD')";
String expected = "NAME, to_char(DATE, ('YYYY,MM,DD')), to_char(DATE, 'YYYY-MM-DD'), CITY";
String result = removeDuplicates(source);
assert result.equals(expected);
// Combined
String source = "NAME, to_char(DATE, 'YYYY,MM,DD'), to_char(DATE, ('YYYY-MM-DD')), NAME, to_char(DATE, ('YYYY-MM-DD')), CITY || ', UK', CITY || ', US''s CITY', CITY || ', UK'";
String expected = "NAME, to_char(DATE, 'YYYY,MM,DD'), to_char(DATE, ('YYYY-MM-DD')), CITY || ', UK', CITY || ', US''s CITY'";
String result = removeDuplicates(source);
assert result.equals(expected);
I initially tried to 1) split the string on comma outside quotes (Splitting on comma outside quotes), 2) make the items unique, 3) then join them together.
However, it doesn't work when to_char(DATE, 'YYYY-MM-DD') appears in the String.
Could anyone come up with something or suggest any library that helps solving this problem? Thanks in advance.
ADDED:
If we don't worry about sub-queries, the difficult part is splitting the criteria into valid elements. Trimming and making them uniqueIgnoreCase is easy to achieve.
For the splitting, I think the combination of the below should cover all the scenarios:
- split by ,
- on each element, ignore checking comma within the first ( and the last )
- on each element, ignore checking comma within the first ' and the last '