How to Split a mongoDB Collection that has been exported to a .json file?

Question

I have a .json file which contains some data and was obtained by exporting a mongoDB database collection. In order to test perform Machine Learning Training and Testing on the data, I want to split the json file into 2 separate files following a particular test-train ratio, which I'm unable to do on my Python notebook, or on the mongoDB console.

I have tried to manually split the records using notepad, but that doesn't ensure accuracy I need for splitting.

I also tried converting the json file to a Pandas dataframe, but then I lose the format in which data is stored, since to_json() saves all the records of the first column first, second column second, and so on, which I don't want.

My json file is available here!

skymon · Answer 1 · 2019-09-08T07:34:48.613

The problem with your file is, that it's not a valid json. If you had a valid json you could just use json.loads() to have your entries as a python list and split that list e.g. like that. For your file, one way would be to convert to valid json. But if you want to keep the current form you could just do a file split for the relevant lines. So this should work:

import math

fname_all = 'reddit_india_using_mongoexport.json'
fname_train = 'reddit_india_using_mongoexport_train.json'
fname_test = 'reddit_india_using_mongoexport_test.json'

def file_len(fname):
    with open(fname) as f:
        for i, l in enumerate(f):
            pass
    return i + 1

len = file_len(fname_all)

split_ratio = 0.8

f_train = open(fname_train, "w+")
f_test = open(fname_test, "w+")

with open(fname_all) as f:
        for i, l in enumerate(f):
            if i < math.floor(len*split_ratio):
                f_train.write(l)
            else:
                f_test.write(l)

f_train.close()
f_test.close()

print('Original file:' + str(file_len(fname_all)))
print('Train file:' + str(file_len(fname_train)))
print('Test file:' + str(file_len(fname_test)))

It gives you:

Original file:8076

Train file:6460

Test file:1616

Why is it not a valid json? And how do you convert to a valid json — user36160, Sep 08 '19 at 11:00
I'm getting the following error: `UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 1436: character maps to ` — user36160, Sep 08 '19 at 18:18

How to Split a mongoDB Collection that has been exported to a .json file?

1 Answers1