0

I have a .json file which contains some data and was obtained by exporting a mongoDB database collection. In order to test perform Machine Learning Training and Testing on the data, I want to split the json file into 2 separate files following a particular test-train ratio, which I'm unable to do on my Python notebook, or on the mongoDB console.

I have tried to manually split the records using notepad, but that doesn't ensure accuracy I need for splitting.

I also tried converting the json file to a Pandas dataframe, but then I lose the format in which data is stored, since to_json() saves all the records of the first column first, second column second, and so on, which I don't want.

My json file is available here!

user36160
  • 21
  • 1
  • 3

1 Answers1

0

The problem with your file is, that it's not a valid json. If you had a valid json you could just use json.loads() to have your entries as a python list and split that list e.g. like that. For your file, one way would be to convert to valid json. But if you want to keep the current form you could just do a file split for the relevant lines. So this should work:

import math

fname_all = 'reddit_india_using_mongoexport.json'
fname_train = 'reddit_india_using_mongoexport_train.json'
fname_test = 'reddit_india_using_mongoexport_test.json'

def file_len(fname):
    with open(fname) as f:
        for i, l in enumerate(f):
            pass
    return i + 1

len = file_len(fname_all)

split_ratio = 0.8

f_train = open(fname_train, "w+")
f_test = open(fname_test, "w+")

with open(fname_all) as f:
        for i, l in enumerate(f):
            if i < math.floor(len*split_ratio):
                f_train.write(l)
            else:
                f_test.write(l)

f_train.close()
f_test.close()

print('Original file:' + str(file_len(fname_all)))
print('Train file:' + str(file_len(fname_train)))
print('Test file:' + str(file_len(fname_test)))

It gives you:

Original file:8076

Train file:6460

Test file:1616

skymon
  • 850
  • 1
  • 12
  • 19