1

I have a json file that contains 500k Objects and this is its format:

"WORKORDER": [
            { 
                "Attributes": { 
                    "SITEID": {
                        "content": "BEDFORD"
                    },
                    "WONUM": {
                        "content": "1000"
                    },
                    "WOPRIORITY": {
                        "content": 2
                    },
                    "WORKTYPE": {
                        "content": "CM"
                    }
                }
            },
            { 
                "Attributes": { 
                    "SITEID": {
                        "content": "BEDFORD"
                    },
                    "WONUM": {
                        "content": "1000-10"
                    },
                    "WORKTYPE": {
                        "content": "CM"
                    }
                }
            }

Im geting the distinct values like this :

for (int i = 0; i < WORKORDER.length(); i++) {  
                         JSONObject obj = WORKORDER.getJSONObject(i); 

                         JSONObject att = obj.getJSONObject("Attributes"); 

                         if( att.has(col)){ // getting col from params in the servlet

                             JSONObject column = att.getJSONObject(col);
                           Object  colval =   column.get("content");

                           if(!(list.contains(colval))) {
                              out.println( colval); 
                               list.add(colval);

                             }

But it takes long time for only 5000 objects !

Is there any way to get the distinct values of any column without parsing the whole Json file, otherwise parsing only the column needed.

Anas Elbenney
  • 33
  • 1
  • 2
  • 11
  • 1
    You can use a `Set` instead of a `List`. (if `list` stand for a `List`). That would be a "simpler" code. Now to filter the JSON directly... unless you can change the JSON generation, you still need to read the JSON (`JSONObject` is just a parser). – AxelH Apr 17 '18 at 11:41
  • you can try to convert it into a csv, and then fetch the needed column – Pankaj Singhal Apr 17 '18 at 11:41
  • @AxelH will it accelerate the parse when using the set ??? – Anas Elbenney Apr 17 '18 at 12:45
  • @PankajSinghal but it will take time also for converting into csv than fetching !! – Anas Elbenney Apr 17 '18 at 12:46
  • Yes and no. It depends on the implementation you take. Using an `HashSet` would do it since it use buckets. For the moment, this is the `contains` method that kill you, it need to iterate the list for each element of the JSON. At first it's okay, but when you already have 1000 items, those will be iterated 400K times mores. – AxelH Apr 17 '18 at 12:47

1 Answers1

1

You are iterating on a JSON with 500k elements. For each element you check if it was previously added in a List. That means your logic will iterate the list 500k times.

Instead, you should use a HashSet, first, the Set prevent duplicated value. So you just need to set.add(value) but the most interesting is the fact that the instance have a constant complexity value. Since it used buckets to organize the value, it doesn't have to iterate the Set fully.

You can read more about that in amit answer's about How can a HashSet offer constant time add operation?

Note that HashSet gives amortized and average time performance of O(1), not worst case. This means, we can suffer an O(n) operation from time to time. So, when the bins are too packed up, we just create a new, bigger array, and copy the elements to it.

Note that to use any Hash#### implementation, you need to make sure the instance you store implements hashCode and equals correctly. You can find out more about this in the community post about What issues should be considered when overriding equals and hashCode in Java?.

Now for the solution :

Set<Object> sets = new HashSet<>();
for (int i = 0; i < WORKORDER.length(); i++) {  
    // ...
    Object  colval =   column.get("content");
    if(sets.add(colval)){ //`add` return true if it wasn't present already.
        out.println( colval); 
    }
}

I kept the Object type but this should be correctly typed, at least to be sure that those instance are implementing those methods as needed.

colval being an Object, it is possible it doesn't implements correctly the methods needed so I suggest you parse it correctly. You should use column.getString("content) instead or check the instance type.


To validate this, I have used a method to create a fake JSON:

public static JSONObject createDummyJson(int items) {
    JSONObject json = new JSONObject();
    JSONArray orders = new JSONArray();
    json.put("order", orders);

    JSONObject attributes;
    JSONObject item;
    JSONObject order;

    Random rand = new Random();
    String[] columns = {"columnA", "columnB", "columnC", "columnD"};

    for(int i = 0; i < items; ++i) {
        order = new JSONObject();
        attributes = new JSONObject();
        order.put("Attributes", attributes);
        orders.put(order);

        for(int j = 0; j < rand.nextInt(1000) % columns.length; ++j) {
            item= new JSONObject();
            long rValue = rand.nextLong();
            item.put("content",  j%3 == 0 ? ("" + rValue ) : rValue );
            attributes.put(columns[j], item);
        }
    }
    return json;
}

Then ran a basic benchmark for both method and had the following results :

public static void main(String[] args) throws Exception {
    final int jsonLength = 500_000;
    JSONObject json = createDummyJson(jsonLength);
    long start = System.currentTimeMillis();
    List<Object> list = parseJson(json);
    long end = System.currentTimeMillis();
    System.out.format("List - Run in %d ms for %d items and output %d lines%n", end-start, jsonLength, list.size());

    start = System.currentTimeMillis();
    Set<Object> set = parseJsonSet(json);
    end = System.currentTimeMillis();
    System.out.format("Set - Run in %d ms for %d items and output %d lines%n", end-start, jsonLength, set.size());
}

public static List<Object> parseJson(JSONObject json) {
    String col = "columnC";

    JSONArray array = json.getJSONArray("order");

    List<Object> list = new ArrayList<>();
    for (int i = 0; i < array.length(); i++) {
        JSONObject obj = array.getJSONObject(i);
        JSONObject att = obj.getJSONObject("Attributes");
        if (att.has(col)) { // getting col from params in the servlet
            JSONObject column = att.getJSONObject(col);
            Object colval = column.get("content");

            if (!(list.contains(colval))) {
                //System.out.println(colval);
                list.add(colval);
            }
        }
    }

    return list;
}

public static Set<Object> parseJsonSet(JSONObject json) {
    String col = "columnC";

    JSONArray array = json.getJSONArray("order");

    Set<Object> set = new HashSet<>();
    for (int i = 0; i < array.length(); i++) {
        JSONObject obj = array.getJSONObject(i);
        JSONObject att = obj.getJSONObject("Attributes");
        if (att.has(col)) { // getting col from params in the servlet
            JSONObject column = att.getJSONObject(col);
            Object colval = column.get("content");

            if (set.add(colval)) {
                //System.out.println(colval);
            }
        }
    }

    return set;
}

List - Run in 5993 ms for 500000 items and output 46971 lines
Set - Run in 62 ms for 500000 items and output 46971 lines

I even went to a JSON with 5M row (removed the List that would never end)

Set - Run in 6436 ms for 5000000 items and output 468895 lines

Important, remove the line that print in console every new insertion, it will reduce the execution time a bit.

AxelH
  • 14,325
  • 2
  • 25
  • 55
  • @AnasElbenney Please note the last part, use `column.get` isn't safe with this logic, use `getString("content")` instead. (Forgot you've shown the JSON...) – AxelH Apr 17 '18 at 13:14
  • but I dont know if the user choose a string column or number ! – Anas Elbenney Apr 17 '18 at 13:19
  • sorry it is doing it too long too, the same time for both (set and list) – Anas Elbenney Apr 17 '18 at 13:24
  • Ok @AnasElbenney I will generate a dummy JSON to do some testing. I will get back to you when this is done (this evening GMT probably). – AxelH Apr 17 '18 at 13:58
  • @AnasElbenney I have ran with a "similar" json and have validated the difference. If it didn't worked, then this is not the `Set` that takes time... – AxelH Apr 18 '18 at 08:30
  • but im calling from ajax this servlet to give me back the distinct values ! how i will doing it without printing? – Anas Elbenney Apr 18 '18 at 09:06
  • @AnasElbenney You have a request in Ajax sendind you a JSON and you directly print in the response output a line by line result (not a JSON) ? Why don't you generate a JSON from the `Set` ? Just check if this is not the print that is slow, at least we can confirm that this solve the first problem. – AxelH Apr 18 '18 at 09:08
  • but Im printing them as list in html like this :out.println( "
  • " +colval +"
  • "); – Anas Elbenney Apr 18 '18 at 09:15
  • when i do it with sets and just for 5k jsonobject it takes 17000ms !! its weird – Anas Elbenney Apr 18 '18 at 09:19
  • i also tried without printing anything and its lasts the same duration – Anas Elbenney Apr 18 '18 at 10:31
  • @AnasElbenney Can you try with the code I have used for the test ? It generate a "random" JSON. I would guess the problem is mostly to parse the `String` into the `JSONObject`. On what are you running this, a Raspberry ;) ? – AxelH Apr 18 '18 at 11:19
  • im receiving `JSONObjects` from Link , and doing `HttpURlconnection` and `BufferedReader` to read the json file! it seams that this method is solwing down the fetch ? – Anas Elbenney Apr 18 '18 at 11:39
  • That indeed a possibilities @AnasElbenney . Hard to tell. All I can say is that the example provided in the answer show a drastic improvement to read a JSON and store value in a `Collection`. For the rest. This required more information, you need to check what is taking so long ... all I can do is guess. A basic way would be to print `System.currentTimeInMillis()` on specific point to see how long it take. I can't tell you more than that. – AxelH Apr 18 '18 at 11:49