2

This is a challenge about String manipulation.

The statement below forms the GROUP BY for a SQL statement. I would like to write a method called removeDuplicates() to remove the duplicate items.

E.g.

// Comma in quotes
String source = "ADDRESS.CITY || ', UK', ADDRESS.CITY || ', US', ADDRESS.CITY || ', UK'";
String expected = "ADDRESS.CITY || ', UK', ADDRESS.CITY || ', US'";
String result = removeDuplicates(source);
assert result.equals(expected);

// Comma in quotes with escaped single quotes
String source = "ADDRESS.CITY || ', UK''s CITY', ADDRESS.CITY || ', US''s CITY', ADDRESS.CITY || ', UK''s CITY'";
String expected = "ADDRESS.CITY || ', UK''s CITY', ADDRESS.CITY || ', US''s CITY'";
String result = removeDuplicates(source);
assert result.equals(expected);

// Comma in parentheses
String source = "NAME, to_char(DATE, 'YYYY,MM,DD'), to_char(DATE, 'YYYY-MM-DD'), NAME, CITY, to_char(DATE, 'YYYY-MM-DD')";
String expected = "NAME, to_char(DATE, 'YYYY,MM,DD'), to_char(DATE, 'YYYY-MM-DD'), CITY";
String result = removeDuplicates(source);
assert result.equals(expected);

// Comma in parentheses with parentheses
String source = "NAME, to_char(DATE, ('YYYY,MM,DD')), to_char(DATE, 'YYYY-MM-DD'), NAME, CITY, to_char(DATE, 'YYYY-MM-DD')";
String expected = "NAME, to_char(DATE, ('YYYY,MM,DD')), to_char(DATE, 'YYYY-MM-DD'), CITY";
String result = removeDuplicates(source);
assert result.equals(expected);

// Combined
String source = "NAME, to_char(DATE, 'YYYY,MM,DD'), to_char(DATE, ('YYYY-MM-DD')), NAME, to_char(DATE, ('YYYY-MM-DD')), CITY || ', UK', CITY || ', US''s CITY', CITY || ', UK'";
String expected = "NAME, to_char(DATE, 'YYYY,MM,DD'), to_char(DATE, ('YYYY-MM-DD')), CITY || ', UK', CITY || ', US''s CITY'";
String result = removeDuplicates(source);
assert result.equals(expected);

I initially tried to 1) split the string on comma outside quotes (Splitting on comma outside quotes), 2) make the items unique, 3) then join them together.

However, it doesn't work when to_char(DATE, 'YYYY-MM-DD') appears in the String.

Could anyone come up with something or suggest any library that helps solving this problem? Thanks in advance.

ADDED:

If we don't worry about sub-queries, the difficult part is splitting the criteria into valid elements. Trimming and making them uniqueIgnoreCase is easy to achieve.

For the splitting, I think the combination of the below should cover all the scenarios:

- split by ,
- on each element, ignore checking comma within the first ( and the last )
- on each element, ignore checking comma within the first ' and the last '
Community
  • 1
  • 1
Max
  • 1,064
  • 9
  • 23
  • If that's really as complicated as it's going to get, you can just ignore commas inside parentheses instead of inside quotes. – azurefrog Sep 02 '15 at 21:11
  • @azurefrog Thanks for the reply. Since "(NAME, CITY), Region" is invalid SQL, this is likely to work when sub-query is not used (sub-query e.g. NAME, (select count(*) from shipments ...), CITY). I'm going to give it a try but not entirely sure if it would run into other problems. – Max Sep 02 '15 at 21:28
  • @a_horse_with_no_name Sorry, I think I'll need to put the sql label back to this post because this requires valid SQL to work, and may even require advanced SQL knowledge. Please refer to my reply to azurefrog. Hope you don't mind. – Max Sep 02 '15 at 21:33
  • If it's possible to have sub-queries you can't just use matching. You'll want to use a real parser in that case. – azurefrog Sep 02 '15 at 21:47
  • @azurefrog Thanks a lot – Max Sep 02 '15 at 21:59
  • Question: Is `to_char(DATE, 'YYYY,MM,DD')` and `to_char(DATE,'YYYY,MM,DD')` (with and without space after comma) to be considered the same? Is `NAME`, `Name`, and `name` to be considered the same, given that SQL names are not case-sensitive? If any of those are "yes", then you need a SQL parser. – Andreas Sep 02 '15 at 22:30
  • @Andreas They don't need to be case sensitive. After the split, I can do a trim() and compareIgnoreCase() to solve space and case sensitive problems. – Max Sep 03 '15 at 12:55
  • @azurefrog Hi, just to let you know that I have added 2 valid scenarios. The 1st one has quotes without parentheses. I think I can ignore sub-queries for now. I'll do some more digging... – Max Sep 03 '15 at 13:02

3 Answers3

2

EDIT: Here is an update. It has been modified to ignore all sections between quotes and all sections between parentheses when searching for commas. It is not guaranteed to work for arbitrary SQL but passes all the situations you've described thus far.

EDIT: Updated code again to ignore close parentheses inside quotes

import java.util.ArrayList;
import java.util.Iterator;

public class Main
{
    private static final String GUID = "f61916a6-3859-4cda-ae2f-209ff3802831";

    public static void main(String args[])
    {
        // Comma in quotes
        String source = "ADDRESS.CITY || ', UK', ADDRESS.CITY || ', US', ADDRESS.CITY || ', UK', to_char(DATE, '(YYYY)MM,DD'), to_char(DATE, '(YYYY)MM,DD')";
        String expected = "ADDRESS.CITY || ', UK', ADDRESS.CITY || ', US', to_char(DATE, '(YYYY)MM,DD')";
        String result = removeDuplicates(source);
        System.out.println(result.equals(expected));

        // Comma in quotes with escaped single quotes
        source = "ADDRESS.CITY || ', UK''s CITY', ADDRESS.CITY || ', US''s CITY', ADDRESS.CITY || ', UK''s CITY'";
        expected = "ADDRESS.CITY || ', UK''s CITY', ADDRESS.CITY || ', US''s CITY'";
        result = removeDuplicates(source);
        System.out.println(result.equals(expected));

        // Comma in parentheses
        source = "NAME, to_char(DATE, 'YYYY,MM,DD'), to_char(DATE, 'YYYY-MM-DD'), NAME, CITY, to_char(DATE, 'YYYY-MM-DD')";
        expected = "NAME, to_char(DATE, 'YYYY,MM,DD'), to_char(DATE, 'YYYY-MM-DD'), CITY";
        result = removeDuplicates(source);
        System.out.println(result.equals(expected));

        // Comma in parentheses with parentheses
        source = "NAME, to_char(DATE, ('YYYY,MM,DD')), to_char(DATE, 'YYYY-MM-DD'), NAME, CITY, to_char(DATE, 'YYYY-MM-DD')";
        expected = "NAME, to_char(DATE, ('YYYY,MM,DD')), to_char(DATE, 'YYYY-MM-DD'), CITY";
        result = removeDuplicates(source);
        System.out.println(result.equals(expected));

        // Combined
        source = "NAME, to_char(DATE, 'YYYY,MM,DD'), to_char(DATE, ('YYYY-MM-DD')), NAME, to_char(DATE, ('YYYY-MM-DD')), CITY || ', UK', CITY || ', US''s CITY', CITY || ', UK'";
        expected = "NAME, to_char(DATE, 'YYYY,MM,DD'), to_char(DATE, ('YYYY-MM-DD')), CITY || ', UK', CITY || ', US''s CITY'";
        result = removeDuplicates(source);
        System.out.println(result.equals(expected));
    }

    private static String removeDuplicates(String source)
    {
        // Replace escaped quotes with a GUID to make it easier to parse
        source = source.replace("''", GUID);

        source = source + ','; // Hacky way to get the last part to show up

        ArrayList<String> elements = new ArrayList<String>();

        ArrayList<Character> charArray = new ArrayList<Character>();

        for (char c : source.toCharArray())
            charArray.add(c);

        Iterator<Character> itr = charArray.iterator();

        // Identify all the elements
        String thusFar = "";
        while (itr.hasNext())
        {
            char next = itr.next();

            if (next == ',')
            {
                thusFar = thusFar.trim();
                if (!elements.contains(thusFar))
                    elements.add(thusFar);
                thusFar = "";
                continue;
            }

            thusFar += next;

            // Ignore anything inside quotes
            if (next == '\'')
            {
                char c;
                while ((c = itr.next()) != '\'')
                {
                    thusFar += c;
                }
                thusFar += c;
                continue;
            }

            // Ignore anything inside parentheses
            if (next == '(')
            {
                while (true)
                {
                    char c = itr.next();
                    thusFar += c;

                    if (c == ')')
                        break;

                    // Ignore anything inside quotes inside parentheses (including a close paren)
                    if (c == '\'')
                    {
                        char c2 = itr.next();
                        while (c2 != '\'')
                        {
                            thusFar += c2;
                            c2 = itr.next();
                        }
                        thusFar += c2;
                    }
                }

                continue;
            }
        }

        // Combine all the elements back together
        String result = "";

        for (String s : elements)
            result += s + ", ";

        if (result.length() > 2)
        {
            result = result.substring(0, result.length() - 2);
        }

        // Put the escaped quotes back in
        result = result.replace(GUID, "''");

        return result;
    }
}
Daniel Centore
  • 3,220
  • 1
  • 18
  • 39
  • Thanks a lot for the reply. I have added 2 scenarios because I have identified comma in quotes without parentheses. Sorry about adding things afterwards that is not covered by your solution. Parentheses within quotes is very likely to happen in the 1st scenario. – Max Sep 03 '15 at 13:08
  • 1
    @Ming Please see my code update and let me know if I've missed any possible scenario. – Daniel Centore Sep 03 '15 at 17:12
  • Thanks for putting in so much effort. I tried to put () within () and then it failed. There could be double single quotes in a SQL string to escape single quotes. I have updated the scenarios 1, 3, 5 to make it even more challenging. The ADDED part is how I believe it should happen, but it doesn't mean it's the way we should write our solution. I'm considering 1) use regex to split , outside of '', then 2) join the ones being mistakenly split :D – Max Sep 04 '15 at 13:54
  • 1
    My current code works with all your test cases, even the updated ones. Is there a problem with it? – Daniel Centore Sep 05 '15 at 00:41
  • You are right. Thanks again and sorry about the confusion. If I try to put () with '' within (), e.g. "to_char(DATE, '(YYYY)MM,DD'), to_char(DATE, '(YYYY)MM,DD')", it doesn't work. If I want to solve my problem, it could be anything within single quotes. It would work if I could ignore everything inside the single quotes. Would you think it's possible to achieve this? – Max Sep 05 '15 at 17:41
  • This works really well. Thank you so much for all the effort you have put in. – Max Sep 06 '15 at 12:33
1

Better use a csv library, otherwise commas inside single or double quotes (which could be nested), quotes/commas after escapes, escapes which cancel escapes you need to deal with.
https://commons.apache.org/proper/commons-csv/

Regex can not deal with nested structures. Theoretically not possible.

  • Thanks. I'll do some investigation to see how I can make use of this library. – Max Sep 02 '15 at 21:54
  • CSV library wouldn't work. It will have the same split problem that you currently have. – Daniel Centore Sep 03 '15 at 16:46
  • ofcourse he should provide valid csv entries. my point is not to parse it by hand. –  Sep 03 '15 at 17:29
  • @ömeryılmaz This doesn't work for me either. I have tried the 5 CSVFormat they provided and non of them is suitable. – Max Sep 04 '15 at 13:57
  • @Ming, because entries should follow a format/convension, my answer was about parsing comma seperated fields. maybe it's not what you need. if you tell from where you get those strings, more specific answers may come. Forexample if you are trying to parse SQL, an sql parser library will do better. something like this http://zql.sourceforge.net/. it's using javacc, because job is too heavy/cumbersome for regex or hand written parsers. –  Sep 04 '15 at 14:59
  • @ömeryılmaz Thanks. Unfortunately we need something to work at run time. Initially I wouldn't worry about performance. Once we have a working solution, we can look into how to improve it. – Max Sep 04 '15 at 15:23
  • A dirty hack, use a character which can not appear inside fields to seperate them, and use split. Without knowing exact details of the format you want to parse, nothing else could be said. There are unescaped/unquoted commas in your strings which is not supposed to seperate fields, CSV will try to use them as field seperators. i don't know if it's worth to convert those strings to valid CSV entries because i don't know where they are coming and how they will be used. –  Sep 04 '15 at 16:05
0

If you dont have nested functions You can simply use regexp to tokenize string:

/([a-z_]+\([^\(\)]*?\))|([A-Z_]+)/g

and then remove duplicates. [a-z_]+ matches function name, \([^\(\)]*?\) matches function params - everything exept "(" and ")".The last part ([A-Z]+) matches uppercase field names.

For provided example it will produce something like this:

NAME
to_char(DATE, 'YYYY,MM,DD')
to_char(DATE, 'YYYY-MM-DD')
NAME
CITY
to_char(DATE, 'YYYY-MM-DD')
Maciej Stępyra
  • 310
  • 1
  • 6
  • Thanks for the reply. I have added 2 scenarios because I have identified comma in quotes without parentheses. Sorry about adding things afterwards that is not covered by your solution. – Max Sep 03 '15 at 12:52