-1

I am creating a AWS Lambda function in Java to process Kinesis Data Stream.

My current setup of parsing involves:

  1. Stringify using UTF-8 as suggested in AWS Documentation
            for(KinesisEvent.KinesisEventRecord rec : event.getRecords())
            {
                String stringRecords = new String(rec.getKinesis().getData().array(), "UTF-8");

                    pageEventList.add(pageEvent);
            }
            
  1. Clean up characters using Regex Patterns
   a. non-ascii: "[^\\x00-\\x7F]";
   b. ascii-control-characters: "[\\p{Cntrl}&&[^\r\n\t]]";
   c. non-printable-characters: "\\p{C}";
  1. Format json string objects without square brackets and commas
        int firstBeginningCurlyBracketIndex = cleanString.indexOf("{");
        if (firstBeginningCurlyBracketIndex != -1 ){
            cleanString = cleanString.substring(firstBeginningCurlyBracketIndex + 1);
            cleanString = "[{" + cleanString;
        }

        int lastIndexOfCurlyBracketIndex = cleanString.lastIndexOf("}");
        if (lastIndexOfCurlyBracketIndex != -1) {
            cleanString = cleanString.substring(0, lastIndexOfCurlyBracketIndex);
            cleanString = cleanString + "}]";
        }

        cleanString = cleanString.replaceAll("}\\{", "\\},\\{");

Currently, when I got this far, I am using Regex parsing to separate and parse them into JSON object. Reference: How to match string within parentheses (nested) in Java?

        String REGEX_BRACKET_PATTERN_TWO_LAYERS = "(\\{(?:[^}{]+|\\{(?:[^}{]+|\\{[^}{]*\\})*\\})*\\})";

        Pattern splitDelRegex = Pattern.compile(REGEX_BRACKET_PATTERN_TWO_LAYERS);
        Matcher regexMatcher = splitDelRegex.matcher(nonAsciiRemovedString);
        List<String> matcherList = new ArrayList<String>();
        while (regexMatcher.find()) {
            String perm = regexMatcher.group(1);
            matcherList.add(perm);
        }

I have attempted to use Gson and Jackson to parse string-json-array after step 3 (ref: How to parse JSON in Java). Parsing works fine until a random invalid JSON / string appears out of Data Stream and throws exception - java.lang.Exception: com.google.gson.JsonSyntaxException: java.lang.IllegalStateException: Expected BEGIN_ARRAY but was STRING at line 2 column 1 path $

Invalid json that causes this exception looks something like this:

[

 ...

  {
    "name": "banana"
    "description": "description"
  },
  {
    "name": "orange"
    "description": "description"
  }
GD~
{}
FDSE-}
]

My questions are:

  1. Since the last random string part is very random, I am having difficulties formatting the whole string into valid string json array. If anybody has a good Idea to make sure this string json array is always valid.

  2. Aside from what I have described in steps to parse Kinesis Data Stream to Json data, which by the way is working using REGEX although I still notice that random string at the end, if anybody has experience in this parsing process, please share with the community. I feel like AWS Documentation on this topic of Lambda-Kinesis is not detail enough to make sure the whole parsing process.

Adding to this, I am aware that this could just all be because of the quality of data from data stream. It would also be nice just to hear other people's experience on handling their data on this topic.

uhdang
  • 3
  • 4
  • 4
    *"how to parse json..."* - not manually, use a framework and dont reinvent the wheel. jackson, gson, ... – Zabuzard Jul 23 '20 at 15:36
  • *"Because the string is not formatted as json when in a string form, using Gson or Jackson throws exception currently."* - what do u mean by that? is your data valid json or not? if it is, the frameworks will be able to parse it. – Zabuzard Jul 23 '20 at 15:39
  • 1
    See https://stackoverflow.com/questions/2591098/how-to-parse-json-in-java – Stephen C Jul 23 '20 at 15:41
  • 2
    Curly brackets are legal in a JSON string value. Please show us the code where you use Gson or Jackson and it is failing to parse the above strings. – Stephen C Jul 23 '20 at 15:48
  • @Zabuzard So, there are non-ascii characters and non-printable characters and some random numbers like 234 are also there before and after string json objects. So, when I ran the whole string with Gson and Jackson, it threw exception. – uhdang Jul 23 '20 at 15:58
  • Your updated data is invalid JSON, hence why the libraries failed. It is not JSON that you are parsing here but JSON with some extra stuff. Why are you working with such weird data? Why cant you first get rid of the weird stuff and then parse valid JSON with a library. – Zabuzard Jul 23 '20 at 16:29
  • @Zabuzard Yes, I am currently getting rid of the unnecessary part at the beginning and the end part. I will try parsing it afterwards. Thank you for the input ! – uhdang Jul 23 '20 at 16:43
  • Does this answer your question? [How to parse JSON in Java](https://stackoverflow.com/questions/2591098/how-to-parse-json-in-java) – Mickael Jul 23 '20 at 16:53
  • @Mickael thank you for the comment ! Unfortunately, the issue seems to be not about parsing valid json string, in which case Gson library or other parsing library would work, but handling data stream of json string with invalid json string. – uhdang Jul 23 '20 at 20:45
  • As this post is closed, I have opened a new one with better explanation and detail. If anybody has experience in working with Kinesis Data Stream with AWS Lambda in Java, please check this one out - https://stackoverflow.com/questions/63071209/how-to-parse-kinesis-data-stream-in-aws-lambda-java – uhdang Jul 25 '20 at 16:32

1 Answers1

0

I tried with Gson library :

String jsonString = "{\"username\": \"apple2\", \"description\": \"this is an example{where problem is2}\" }";
    
GsonBuilder builder = new GsonBuilder();
Map<String,String> o = (Map<String, String>) builder.create().fromJson(jsonString, Object.class);
    
System.out.println("Map object : " + o);
System.out.println("UserName : " + o.get("username"));
System.out.println("Description : " + o.get("description"));

Output :

Map object : {username=apple2, description=this is an example{where problem is2}}
UserName : apple2
Description : this is an example{where problem is2}
Ankit Chauhan
  • 646
  • 6
  • 20
  • Thank you for giving it a try !! I have updated the question because the string I was dealing with also contained random numbers and non-printable characters. Should have included it in original question. Sorry for confusion – uhdang Jul 23 '20 at 16:02
  • 2
    Are the random numbers only on the start? You can easily extract out the numbers and parse the remaining String. Between that's not a valid JSON. – Ankit Chauhan Jul 23 '20 at 16:09
  • The string starts with random number and ends with random string. And there are json strings in between. i.e. 234454{ ... } { ... } { .... } { ... } { ... }]]H7 – uhdang Jul 23 '20 at 16:22
  • Currently trying to find out how to get rid of those in the front and at the end – uhdang Jul 23 '20 at 16:22
  • As I updated my question, cleaned up portion of json string object array is parsed correctly with Gson library, but random string pops in and throws in invalid json object, which leads to exception. I'll have to think about how to handle this exception .. – uhdang Jul 23 '20 at 20:47