2

I have a JSON file (.json) in Amazon S3. I need to read it and create a new field called Hash_index for each JsonObject. The file is very big, so I am using a GSON library to avoid my OutOfMemoryError in reading the file. Below is my code. Please note that I am using GSON

  //Create the Hashed JSON
    public void createHash() throws IOException
    {
        System.out.println("Hash Creation Started");

        strBuffer = new StringBuffer("");


        try
        {
            //List all the Buckets
            List<Bucket>buckets = s3.listBuckets();

            for(int i=0;i<buckets.size();i++)
            {
                System.out.println("- "+(buckets.get(i)).getName());
            }


            //Downloading the Object
            System.out.println("Downloading Object");
            S3Object s3Object = s3.getObject(new GetObjectRequest(inputBucket, inputFile));
            System.out.println("Content-Type: "  + s3Object.getObjectMetadata().getContentType());



            //Read the JSON File
            /*BufferedReader reader = new BufferedReader(new InputStreamReader(s3Object.getObjectContent()));
            while (true) {
                String line = reader.readLine();
                if (line == null) break;

               // System.out.println("    " + line);
                strBuffer.append(line);

            }*/

           // JSONTokener jTokener = new JSONTokener(new BufferedReader(new InputStreamReader(s3Object.getObjectContent())));
           // jsonArray = new JSONArray(jTokener);

            JsonReader reader = new JsonReader( new BufferedReader(new InputStreamReader(s3Object.getObjectContent())) );
            reader.beginArray();
            int gsonVal = 0;
            while (reader.hasNext()) {
                JsonParser  _parser = new JsonParser();
                JsonElement jsonElement =  _parser.parse(reader);
                JsonObject jsonObject1 = jsonElement.getAsJsonObject();
                //Do something



                StringBuffer hashIndex = new StringBuffer("");

                //Add Title and Body Together to the list
                String titleAndBodyContainer = jsonObject1.get("title")+" "+jsonObject1.get("body");


                //Remove full stops and commas
                titleAndBodyContainer = titleAndBodyContainer.replaceAll("\\.(?=\\s|$)", " ");
                titleAndBodyContainer = titleAndBodyContainer.replaceAll(",", " ");
                titleAndBodyContainer = titleAndBodyContainer.toLowerCase();


                //Create a word list without duplicated words
                StringBuilder result = new StringBuilder();

                HashSet<String> set = new HashSet<String>();
                for(String s : titleAndBodyContainer.split(" ")) {
                    if (!set.contains(s)) {
                        result.append(s);
                        result.append(" ");
                        set.add(s);
                    }
                }
                //System.out.println(result.toString());


                //Re-Arranging everything into Alphabetic Order
                String testString = "acarpous barnyard gleet diabolize acarus creosol eaten gleet absorbance";
                //String testHash = "057        1$k     983    5*1      058     52j    6!v   983     03z";

                String[]finalWordHolder = (result.toString()).split(" ");
                Arrays.sort(finalWordHolder);


                //Navigate through text and create the Hash
                for(int arrayCount=0;arrayCount<finalWordHolder.length;arrayCount++)
                {


                    if(wordMap.containsKey(finalWordHolder[arrayCount]))
                    {
                        hashIndex.append((String)wordMap.get(finalWordHolder[arrayCount]));
                    }

                }

                //System.out.println(hashIndex.toString().trim());

                jsonObject1.addProperty("hash_index", hashIndex.toString().trim()); 
                jsonObject1.addProperty("primary_key", gsonVal); 
                jsonObjectHolder.add(jsonObject1); //Add the JSON Object to the JSON collection

                jsonHashHolder.add(hashIndex.toString().trim());

                System.out.println("Primary Key: "+jsonObject1.get("primary_key"));

                //System.out.println(Arrays.toString(finalWordHolder));
                //System.out.println("- "+hashIndex.toString());

                //break;
                gsonVal++;
            }

            System.out.println("Hash Creation Completed");
        }
        catch(Exception e)
        {
            e.printStackTrace();
        }
    }

When this code is executed, I got the following error

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
        at java.util.Arrays.copyOf(Arrays.java:2894)
        at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:117)
        at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:407)
        at java.lang.StringBuilder.append(StringBuilder.java:136)
        at HashCreator.createHash(HashCreator.java:252)
        at HashCreator.<init>(HashCreator.java:66)
        at Main.main(Main.java:9)
[root@ip-172-31-45-123 JarFiles]#

Line number 252 is - result.append(s);. It is Inside the HashSet loop.

Previously, it generated OutOfMemoryError in line number 254. Line number 254 is - set.add(s); it is also inside the HashSet array.

My Json files are really really big. Gigabytes and Terabytes. I have no idea about how to avoid the above issue.

TylerH
  • 20,799
  • 66
  • 75
  • 101
Dongle
  • 602
  • 1
  • 8
  • 18
  • 1
    If the files are that big you simply **cannot** have them in memory. Think of another solution. – Boris the Spider Feb 04 '14 at 15:47
  • At the end of the loop you are calling `jsonObjectHolder.add` - as this is not a local variable I assume it is instance scoped. This means that you are holding onto all the objects you unmarshall from JSON in memory. You cannot do this - you have to stream the object back out again so the memory can be freed. – Boris the Spider Feb 04 '14 at 16:02
  • @BoristheSpider:Yes. But then how can I get the data inside this `ArayList` outside of the loop? – Dongle Feb 04 '14 at 16:17
  • 2
    You simply cannot have a collection in memory holding all the data. You must look to file based solutions. You can have _part_ of the data in memory but it must be cleared out before you can load the next part. A database would do this for you automatically and you could query it at will. Alternatively you can look at things like a [B-tree](http://en.wikipedia.org/wiki/B-tree) to store structured, queryable, data in a file. – Boris the Spider Feb 04 '14 at 16:30
  • @BoristheSpider: I am following your adive now. Will update you. – Dongle Feb 04 '14 at 16:32

1 Answers1

1

Use a streaming JSON library like Jackson. Read in a some JSON, add the hash, and write them out. Then read in some more, process them, and write them out. Keep going until you have processed all the objects.

http://wiki.fasterxml.com/JacksonInFiveMinutes#Streaming_API_Example

(See also this StackOverflow post: Is there a streaming API for JSON?)

Community
  • 1
  • 1
ahoffer
  • 6,347
  • 4
  • 39
  • 68