2

I have a huge list of jsons (3.23 million jsons). I want to normalize this list and convert it to dataframe. I will end up getting 400 fields after normalizing. I am able to do the above steps(normalize, dataframe) for few thousands of json, but not the entire list.

Here is how I got the list - Going through all the .json files in the folder and appending each and every json to the empty list data_full=[]

`data_full=[]
 path ="a/b/c"
 for file in os.listdir(path):
    full_path = path+'/'+str(file)

    with open(full_path) as f:
        for line in f:
            data_full.append(json.loads(line))`

Given the size of the list, I want to divide the list into '35' equal parts and create new dataframe for each part (df_1, df_2.. df_35). After searching a lot, I could find - how to convert huge list to a single list of list(chunks), and how to convert a list to dataframe, but could not find a way to convert a huge list to multiple new list and convert each one of the list to a new dataframe. The last bit is in italics because I think once I get 35 new lists I can convert them to a new dataframe easily.

So, the question is how do I split this huge list to 35 new lists. If you have any other approach/suggestions to process 3.23 million json to perform some NLP techniques, I would appreciate that too.

Thanks in advance

Amy123
  • 902
  • 9
  • 15
  • What do you mean by "jsons", do you mean python `dict`/`list` objects? Anyway, are you just asking how to chunk your list? Would this help: https://stackoverflow.com/questions/312443/how-do-you-split-a-list-into-evenly-sized-chunks ? – juanpa.arrivillaga Apr 02 '19 at 18:46
  • @juanpa.arrivillaga By jsons I mean, so many json. example - data = [huge_json_1, huge_json_2, ....,huge_json_3.2m]. Here huge_json is a big nested json. (example - twitter data json). I do not want to chunk the list. I want to divide the list into 35 new lists. example - list a =[1,2,3,4,5,6,7,8,9,10]. I want 5 new lists like this - a1=[1,2], a2=[3,4], a3 =[5,6], a4=[7,8], a5=[9,10] – Amy123 Apr 03 '19 at 19:37
  • What you are describing *is* dividing the list into chunks. You should use a container, not 35 variables though. And again, by *json*, you don't mean `str` objects that are in the text-based serialization format, you mean *actual materialized data structures that were at one point represented in json but no longer are*, i.e. `list` and `dict` objects, correct? – juanpa.arrivillaga Apr 03 '19 at 19:44
  • @juanpa.arrivillaga Can you explain the "container, not 35 variables" . I used json.loads() in the previous step to read all the json in all the .json files I have in a folder, and appended it to an empty list using this line of code - data_full.append(json.loads(line)). Now this data_full list is huge(3.23 million). I am looking for a way to divide that into 35 new lists. – Amy123 Apr 05 '19 at 18:36
  • A container means something like a `list` or a `dict`. You don't make 35 variables `a0, a1, ..., a34`, just *make a list* `a` and use `a[i]` when you need it. In any case, you need to provide a [mcve] *in the question itself as formatted text*. Stop putting code in comments. – juanpa.arrivillaga Apr 05 '19 at 18:38
  • Ok, Thanks. I updated the question with code. – Amy123 Apr 05 '19 at 18:54

0 Answers0