How to open .ndjson file in Python?

Question

I have .ndjson file that has 20GB that I want to open with Python. File is to big so I found a way to split it into 50 peaces with one online tool. This is the tool: https://pinetools.com/split-files

Now I get one file, that has extension .ndjson.000 (and I do not know what is that)

I'm trying to open it as json or as a csv file, to read it in pandas but it does not work. Do you have any idea how to solve this?

import json
import pandas as pd

First approach:

df = pd.read_json('dump.ndjson.000', lines=True)

Error: ValueError: Unmatched ''"' when when decoding 'string'

Second approach:

with open('dump.ndjson.000', 'r') as f:

     my_data = f.read() 

print(my_data)

Error: json.decoder.JSONDecodeError: Unterminated string starting at: line 1 column 104925061 (char 104925060)

I think the problem is that I have some emojis in my file, so I do not know how to encode them?

score 10 · Answer 1 · answered Nov 23 '21 at 11:45

10

ndjson is now supported out of the box with argument lines=True

import pandas as pd

df = pd.read_json('/path/to/records.ndjson', lines=True)
df.to_json('/path/to/export.ndjson', lines=True)

answered Nov 23 '21 at 11:45

Banane

755
9
10

score 5 · Answer 2 · answered Aug 20 '20 at 08:37

5

I think the pandas.read_json cannot handle ndjson correctly.

According to this issue you can do sth. like this to read it.

import ujson as json
import pandas as pd

records = map(json.loads, open('/path/to/records.ndjson'))
df = pd.DataFrame.from_records(records)

P.S: All credits for this code go to KristianHolsheimer from the Github Issue

answered Aug 20 '20 at 08:37

Shogoki

109
3

In this line: df = pd.DataFrame.from_records(records) , im getting this error: ValueError: Unmatched ''"' when when decoding 'string' – taga Aug 20 '20 at 09:00
Is there any encoding that I need to add, because maybe I have some emojis or special chars in my file? – taga Aug 20 '20 at 09:00
One more question about this, is there a way to add some kind of progress bar while I'm uploading this file, because too big and I want to know how much time it left or how much of the file has been uploaded? – taga Aug 21 '20 at 09:18

Ofer Rahat · Answer 3 · 2022-10-19T10:50:22.433

2

The ndjson (newline delimited) json is a json-lines format, that is, each line is a json. It is ideal for a dataset lacking rigid structure ('non-sql') where the file size is large enough to warrant multiple files.

You can use pandas:

import pandas as pd
data = pd.read_json('dump.ndjson.000', lines=True)

In case your json strings do not contain newlines, you can alternatively use:

import json
with open("dump.ndjson.000") as f:
    data = [json.loads(l) for l in f.readlines()]

edited Oct 19 '22 at 10:50

answered Sep 12 '22 at 09:21

Ofer Rahat

790
1
9
15

What if json string will contain a newline? This will break data. – Fusion Oct 05 '22 at 12:41

How to open .ndjson file in Python?

3 Answers3

Linked