RegEx for parsing CSV that contains JSON in a "column"

Question

I need to parse a comma separated text file where a line can contain a json as column value. The file also contains a header row.

I am trying to build a regular expression, so that I can parse the file and create a hashmap for each line and list of hashmaps for the file.

I realise that regex probably is going to be over complicated for this, but I don't know how else can I design the code to get the hashmap for each row.

I tried different ways - from examples on SO and other sites, but am unable to get a good working expression. They work on different parts but not on the whole data

    String[] values = readLine.split(",(?=([^\"]*\"[^\"]*\")*(?![^\"]*\"))", -1);
    String[] values = readLine.split(",(?=(?:(?:[^'\",]*(?:'|\")){2})*[^'\",]*)", -1);
    String[] values = readLine.split(",(\"([^\"]|\"\")*\")", -1);
    String[] values = readLine.split(",(\\w+\\s)?(\"[^\"]+\"|\\w+)(\\(\\w\\d(,\\w\\d)*\\))?", -1);
    String[] values = readLine.split(",(?=(?:(?:[^'",]*(?:'|")){2})*[^'",]*$)", -1);

Below is the sample of data

id,apptoolID,apptoolUUID,accountNumber,accountName,name,description,mac,status,pGroups,oemCode,oemTagList,locationID,userCode,businessUnit,customerDescription,notification,ptdd,hdptdd,ptddSchedule,compLive,optInfo,config1,scriptDebugging,clearLocalStorage,created_at,updated_at

5703,2535,9e849f81-56c4-4415-b36e-b0fd370b3986,21126,"3 Way ChProduct",ZYX-21126TC-P1,"3 Way ChProduct - CHSH2112601DSHCL - Showroom",24-1c-04-0a-84-9c,ACTIVE,"[{"id":46,"name":"AX_AGRP_SIRIUSXM_BAC_AXP_GROUP","description":"Shops that can play the AX created\/provided Sirius XM creative.","numberOfPlayers":4494},{"id":33,"name":"AX_ZYX_CUSTOMER","numberOfPlayers":4532},{"id":11,"name":"PST Update","numberOfPlayers":604},{"id":2,"name":"Customer Players","numberOfPlayers":5711},{"id":41,"name":"AX_AGRP_EXTRA_PRICING_PLAN_BAC_AXP_GROUP","description":"AX auto-group Extra Pricing Plan  BAC Group","numberOfPlayers":2215},{"id":39,"name":"AX_AGRP_MY_REWARDS_NATIONAL_GROUP","description":"AX auto-group My Rewards (National) BAC List","numberOfPlayers":2130}]",CH,"ChProduct,ALL,SHOP-SPECIFIC",CHSH2112601DSHCL,"{"oemCode":"222923"}",Showroom,Showroom,false,false,false,,false,"{"OPT_IN_FT":false,"OPT_IN_PSA":false,"OPT_IN_SP":false,"OPT_IN_TS":false,"OPT_IN_VS":false,"OPT_IN_WA":false,"OPT_OUT_ZYX":false,"OPT_OUT_SHOPCONTENT":false,"OPT_OUT_FA":false,"OPT_OUT_GT":false,"OPT_IN_TT":false}",2X2,false,false,"2019-02-18 20:57:53","2019-02-20 19:16:08"
5704,4248,494af61a-54ca-4ff2-a82d-7e795fd671ac,21126,"3 Way ChProduct",ZYX-21126TC-P2,"3 Way ChProduct - CISW2112602DTHCL - Service Waiting Area",e0-d5-5e-63-51-b9,ACTIVE,"[{"id":46,"name":"AX_AGRP_SIRIUSXM_BAC_AXP_GROUP","description":"Shops that can play the AX created\/provided Sirius XM creative.","numberOfPlayers":4494},{"id":33,"name":"AX_ZYX_CUSTOMER","numberOfPlayers":4532},{"id":2,"name":"Customer Players","numberOfPlayers":5711},{"id":27,"name":"FLEX Shop TV Players","description":"FLEX Shop TV Players group","numberOfPlayers":169},{"id":41,"name":"AX_AGRP_EXTRA_PRICING_PLAN_BAC_AXP_GROUP","description":"AX auto-group Extra Pricing Plan  BAC Group","numberOfPlayers":2215},{"id":39,"name":"AX_AGRP_MY_REWARDS_NATIONAL_GROUP","description":"AX auto-group My Rewards (National) BAC List","numberOfPlayers":2130}]",CI,"ChProduct,CADILLAC,ALL,SHOP-SPECIFIC",CISW2112602DTHCL,"{"oemCode":"222923"}",Service_Lounge,"Service Waiting Area",true,true,false,,false,"{"OPT_IN_FT":false,"OPT_IN_PSA":"PSA337,PSA335,PSA332,PSA338,PSA326,PSA323,PS212,PS213,PSA293,PSA289","OPT_IN_SP":"SPCH1807","OPT_IN_TS":false,"OPT_IN_VS":false,"OPT_IN_WA":false,"OPT_OUT_ZYX":false,"OPT_OUT_SHOPCONTENT":"044077,034160,029205,022398,019888,019881,019880,019860,017596,010857","OPT_OUT_FA":"FACA1900,FACA1903,FACA1904,FACA1902,FACA1901,FACA1704,FACA1705,FACA1702,FACA1703,FACA1700,FACA1701,FACA1625,FACA1624,FACA1623,FACA1602,FACA1601,FACA1603,FACA0025,FACA0024,FACA0023,FACA1621,FACA00020,FACA00019,FACA1622,FACH1701","OPT_OUT_GT":false,"OPT_IN_TT":"TT043,TT037,TT033,TT032,TT031,TT027,TT028,TT026,TT029,TT030"}",2X2,false,false,"2019-02-18 20:57:53","2019-02-20 19:16:08"
5705,2537,94c4e9dc-e94e-4942-862a-1e4c98276f09,21126,"3 Way ChProduct",ZYX-21126TC-P3,"3 Way ChProduct - CHMA2112603MBHCL - Service Advisor Area & Service Waiting Area",24-1c-04-0a-84-b7,ACTIVE,"[{"id":46,"name":"AX_AGRP_SIRIUSXM_BAC_AXP_GROUP","description":"Shops that can play the AX created\/provided Sirius XM creative.","numberOfPlayers":4494},{"id":33,"name":"AX_ZYX_CUSTOMER","numberOfPlayers":4532},{"id":11,"name":"PST Update","numberOfPlayers":604},{"id":2,"name":"Customer Players","numberOfPlayers":5711},{"id":41,"name":"AX_AGRP_EXTRA_PRICING_PLAN_BAC_AXP_GROUP","description":"AX auto-group Extra Pricing Plan  BAC Group","numberOfPlayers":2215},{"id":39,"name":"AX_AGRP_MY_REWARDS_NATIONAL_GROUP","description":"AX auto-group My Rewards (National) BAC List","numberOfPlayers":2130}]",CH,"ChProduct,ALL,SHOP-SPECIFIC",CHMA2112603MBHCL,"{"oemCode":"222923"}",Service,"Service Advisor Area & Service Waiting Area",false,false,false,,false,"{"OPT_IN_FT":false,"OPT_IN_PSA":false,"OPT_IN_SP":false,"OPT_IN_TS":false,"OPT_IN_VS":false,"OPT_IN_WA":false,"OPT_OUT_ZYX":false,"OPT_OUT_SHOPCONTENT":false,"OPT_OUT_FA":false,"OPT_OUT_GT":false,"OPT_IN_TT":false}",2X2,false,false,"2019-02-18 20:57:53","2019-02-20 19:16:08"

Basically, having trouble in getting this type of data ( inside [] ) as a single value. The file also has other type of data like nested quotes, commas inside quotes etc.

"[{"id":46,"name":"AX_AGRP_SIRIUSXM_BAC_AXP_GROUP","description":"Shops that can play the AX created\/provided Sirius XM creative.","numberOfPlayers":4494},{"id":33,"name":"AX_ZYX_CUSTOMER","numberOfPlayers":4532},{"id":11,"name":"PST Update","numberOfPlayers":604},{"id":2,"name":"Customer Players","numberOfPlayers":5711},{"id":41,"name":"AX_AGRP_EXTRA_PRICING_PLAN_BAC_AXP_GROUP","description":"AX auto-group Extra Pricing Plan  BAC Group","numberOfPlayers":2215},{"id":39,"name":"AX_AGRP_MY_REWARDS_NATIONAL_GROUP","description":"AX auto-group My Rewards (National) BAC List","numberOfPlayers":2130}]"

Any help / suggestion on how to proceed is appreciated.

Thanks

I'm not very familiar with `JSON` formats. But if `JSON` can have commas in it this could be non-trivial. Even simple `CSV` files can be a challenge. I would probably try a third party parser like the Apache `CSVParser` class to see how that works. — WJS, May 07 '19 at 18:28
As @WJS says, the best way to approach this is using a CSV Parser. Regex is most certainly not the way to do this. — BlackPearl, May 07 '19 at 18:35
@adbdkb is it possible to have this CSV file with delimiter = | ? if yes, I can provide a nice solution(parser) — Sergey Bzhezitskiy, May 07 '19 at 19:34
@SergeyBzhezitskiy - I am getting the file from a different system, but yes, I believe I can request them to change the CSV delimiter to | or ~. Please let me know what the solution would be for one of those delimiter. Thanks — adbdkb, May 07 '19 at 20:09
@jaytea - I was looking at the [example](https://stackoverflow.com/questions/47162098/is-it-possible-to-match-nested-brackets-with-regex-without-using-recursion-or-ba) you have on SO and also your [site](http://www.drregex.com/) where you have generated regex's for seemingly impossible situations and was hoping that you may be able to solve this particular problem. Thanks — adbdkb, May 11 '19 at 19:23

wp78de · Answer 1 · 2019-05-07T23:17:11.857

If your input data would be clean, this could be simply solved easily. However, your JSON strings are not properly escaped, e.g. "{"OPT_IN_FT":false,"OPT_IN_PSA":"PSA337... is using a double quote to indicate a string in the CSV context as well as in the JSON context.

My suggestion is to use single quotes in the JSON context for your items and escape quotes in JSON string items.

Then you could easily adapt solutions like this or that one using regex or custom written parser code.

If you want to change the string delimiter in the CSV context to a single quote you could try something along these lines:

import java.util.regex.Matcher;
import java.util.regex.Pattern;

class Test
{
    public static void main (String[] args) throws java.lang.Exception
    {
      String input = "...";
      Pattern re1= Pattern.compile("(?<=^|,)(?:\"(\\[.*?\\])\"|\"(\\{.*?\\})\"|\"([^\"\\{\\}\\[\\]]*)\")(?=,|$)", Pattern.DOTALL);
      Matcher matcher = re1.matcher(input);
      String singleQuoted = matcher.replaceAll("'$1$2$3'");

      Pattern re = Pattern.compile(",(?=(?:[^']*'[^']*')*[^']*$)");
      String[] parts = re.split(singleQuoted);
        for(int partsIdx = 0; partsIdx < parts.length; partsIdx++ ){
          System.out.println( "[" + partsIdx + "] = " + parts[partsIdx]);
        }
     }
}

There are obviously more elegant ways to do this.

The problem is - I am getting this file from a different system. I may be able to ask them to change the basic delimiter, but they cannot easily change the quotes in the data values in their process. — adbdkb, May 07 '19 at 20:17

score 0 · Answer 2 · answered May 08 '19 at 07:24

If the delimiter is '|' this CSV file might be parsed by

public List<Map<String, String>> parse(Path csvPath) throws IOException {
        Reader in = new FileReader(csvPath.toFile());

        Iterable<CSVRecord> records = CSVFormat.DEFAULT
                .withFirstRecordAsHeader()
                .withIgnoreEmptyLines(true)
                .withDelimiter('|')
                .withTrim()
                .withQuote(null)
                .parse(in);


        return StreamSupport
                        .stream(records.spliterator(), false)
                        .map(csvRecord -> csvRecord.toMap().entrySet().stream()
                                .collect(Collectors.toMap(
                                        e -> e.getKey(),
                                        e -> StringUtils.unwrap(e.getValue(), "\"")
                                )))
                        .collect(Collectors.toList());
}

you need the libs:

<dependency>
    <groupId>org.apache.commons</groupId>
    <artifactId>commons-csv</artifactId>
    <version>1.6</version>
</dependency>
<dependency>
    <groupId>org.apache.commons</groupId>
    <artifactId>commons-lang3</artifactId>
    <version>3.9</version>
</dependency>

Please take a look at a project https://github.com/sbzDev/stackoverflow/tree/master/question56028130 with unit test

Thanks for the solution above. When I requested the delimiter to be changed from comma (,) to tilde (~) - I was able to get correct results back with existing code. So did not have to make any code changes in the current application — adbdkb, May 14 '19 at 18:37

score 0 · Answer 3 · answered Nov 19 '21 at 09:07

Though it's been 2+ years, I hope this answer can help somebody else who comes here from google.

My solution is reading the character one by one, and using the idea of brackets match which is from the introduction of "stack".

When it comes to a "{" or "[", push into stack.

When it comes to a "}" or "]", pop out from the stack.

When it comes to a "," , check the stack. If empty, then you've already read a column field, or else continue read(this comma is inside a json).

Here is the sample code by java, and I simply use a bracketCount variable instead of a real stack:

        File file = new File("D:/test.csv");
        FileReader fileReader = new FileReader(file);
        BufferedReader reader = new BufferedReader(fileReader);
        List<List<String>> results = reader.lines().map(str -> {
            List<String> lineResult = new ArrayList<>(256);
            char[] chars = str.toCharArray();
            int bracketCount = 0;
            StringBuilder fieldBuilder = new StringBuilder();
            for (int index = 0; index < chars.length; index++) {
                fieldBuilder.append(chars[index]);
                if (chars[index] == '{' || chars[index] == '[') {
                    bracketCount++;
                } else if (chars[index] == '}' || chars[index] == ']') {
                    bracketCount--;
                } else if (chars[index] == ',') {
                    if (bracketCount == 0) {
                        lineResult.add(fieldBuilder.substring(0, fieldBuilder.length() - 1));
                        fieldBuilder.setLength(0);
                    }
                } else if (index == chars.length - 1) {
                    lineResult.add(fieldBuilder.substring(0, fieldBuilder.length()));
                    fieldBuilder.setLength(0);
                }
            }
            return lineResult;
        }).collect(Collectors.toList());

PS: The performance are not considered.

And pray that the strings in the JSON don't contain any {, [, ] or } characters or your parsing code will fail — Robin, Nov 19 '21 at 09:11
@Robin I've tested it, and it works ok for me. Cause when it comes to a { or [, the bracketCount added everyTime, and when it comes to a } or ], the bracketCount substracted every time. — Kobe Lee, Nov 19 '21 at 09:19
That is because you tested it with the same number of opening and closing tags. E.g. this is perfectly valid JSON `{"foo": "bar["}` but will start confusing your code. — Robin, Nov 19 '21 at 10:37
@Robin Thanks for pointing out it. Maybe I should reconsider the solution. — Kobe Lee, Nov 24 '21 at 06:53

RegEx for parsing CSV that contains JSON in a "column"

3 Answers3