2

I am trying to load my jsonl file in python. I am using following code and getting error as follows.

with open("mli_train_v1.jsonl", 'r', encoding='utf-8') as f:
    data = json.loads(f)

It's showing error as

TypeError: the JSON object must be str, bytes or bytearray, not 'TextIOWrapper'

So, I tried this

with open("mli_train_v1.jsonl", 'r') as f:
    data = json.load(f)

and I am getting error as

JSONDecodeError: Extra data: line 2 column 1 (char 835)

My jsonl file format is like this

{"sentence1": "Labs were notable for Cr 1.7 (baseline 0.5 per old records) and lactate 2.4.", "pairID": "23eb94b8-66c7-11e7-a8dc-f45c89b91419", "sentence1_parse": "(ROOT (S (NP (NNPS Labs)) (VP (VBD were) (ADJP (JJ notable) (PP (IN for) (NP (NP (NP (NN Cr) (CD 1.7)) (PRN (-LRB- -LRB-) (NP (NP (NN baseline) (CD 0.5)) (PP (IN per) (NP (JJ old) (NNS records)))) (-RRB- -RRB-))) (CC and) (NP (NN lactate) (CD 2.4)))))) (. .)))", "sentence1_binary_parse": "( Labs ( ( were ( notable ( for ( ( ( ( Cr 1.7 ) ( -LRB- ( ( ( baseline 0.5 ) ( per ( old records ) ) ) -RRB- ) ) ) and ) ( lactate 2.4 ) ) ) ) ) . ) )", "sentence2": " Patient has elevated Cr", "sentence2_parse": "(ROOT (S (NP (NN Patient)) (VP (VBZ has) (NP (JJ elevated) (NN Cr)))))", "sentence2_binary_parse": "( Patient ( has ( elevated Cr ) ) )", "gold_label": "entailment"}
{"sentence1": "Labs were notable for Cr 1.7 (baseline 0.5 per old records) and lactate 2.4.", "pairID": "23eb979c-66c7-11e7-b76c-f45c89b91419", "sentence1_parse": "(ROOT (S (NP (NNPS Labs)) (VP (VBD were) (ADJP (JJ notable) (PP (IN for) (NP (NP (NP (NN Cr) (CD 1.7)) (PRN (-LRB- -LRB-) (NP (NP (NN baseline) (CD 0.5)) (PP (IN per) (NP (JJ old) (NNS records)))) (-RRB- -RRB-))) (CC and) (NP (NN lactate) (CD 2.4)))))) (. .)))", "sentence1_binary_parse": "( Labs ( ( were ( notable ( for ( ( ( ( Cr 1.7 ) ( -LRB- ( ( ( baseline 0.5 ) ( per ( old records ) ) ) -RRB- ) ) ) and ) ( lactate 2.4 ) ) ) ) ) . ) )", "sentence2": " Patient has normal Cr", "sentence2_parse": "(ROOT (S (NP (NN Patient)) (VP (VBZ has) (NP (JJ normal) (NN Cr)))))", "sentence2_binary_parse": "( Patient ( has ( normal Cr ) ) )", "gold_label": "contradiction"}
{"sentence1": "Labs were notable for Cr 1.7 (baseline 0.5 per old records) and lactate 2.4.", "pairID": "23eb9986-66c7-11e7-9ef9-f45c89b91419", "sentence1_parse": "(ROOT (S (NP (NNPS Labs)) (VP (VBD were) (ADJP (JJ notable) (PP (IN for) (NP (NP (NP (NN Cr) (CD 1.7)) (PRN (-LRB- -LRB-) (NP (NP (NN baseline) (CD 0.5)) (PP (IN per) (NP (JJ old) (NNS records)))) (-RRB- -RRB-))) (CC and) (NP (NN lactate) (CD 2.4)))))) (. .)))", "sentence1_binary_parse": "( Labs ( ( were ( notable ( for ( ( ( ( Cr 1.7 ) ( -LRB- ( ( ( baseline 0.5 ) ( per ( old records ) ) ) -RRB- ) ) ) and ) ( lactate 2.4 ) ) ) ) ) . ) )", "sentence2": " Patient has elevated BUN", "sentence2_parse": "(ROOT (S (NP (NN Patient)) (VP (VBZ has) (NP (JJ elevated) (NN BUN)))))", "sentence2_binary_parse": "( Patient ( has ( elevated BUN ) ) )", "gold_label": "neutral"}
Aakash aggarwal
  • 443
  • 2
  • 6
  • 21
  • Your file doesn't contain a single root JSON object, which is what `json.load` is designed to read. – jonrsharpe Feb 03 '19 at 09:16
  • 3
    Possible duplicate of [multiple Json objects in one file extract by python](https://stackoverflow.com/questions/27907633/multiple-json-objects-in-one-file-extract-by-python) – jonrsharpe Feb 03 '19 at 09:17
  • 1
    Aren't you meant to do `json.load(f)` in your first example? `loads()` requires a string, not a file handle. So it makes sense the second thing you tried. The problem there is that - your file contains multiple JSON objects, so either you need to do `for line in f: json.loads(line)` or split those lines into multiple files and load them one and one. – Torxed Feb 03 '19 at 09:23

2 Answers2

6

To read a JSONL file one has to read lines and then parse them.

data = []
with open("mli_train_v1.jsonl", 'r', encoding='utf-8') as f:
    for line in f:
       data.append(json.loads(line))
Dan D.
  • 73,243
  • 15
  • 104
  • 123
  • Could be written as `data = [json.loads(line) for line in open("mli_train_v1.jsonl", 'r', encoding='utf-8')]`. – RemcoGerlich Feb 03 '19 at 13:03
  • @RemcoGerlich That loses the file closing that `with` gives you and would need to be rewritten if any statements had to be added. – Dan D. Feb 03 '19 at 13:08
  • `data = [json.loads(line) for line in f]` then, or `data = map(json.loads, f)` if you want an iterator. Main point was that initializing a list and appending to it from a loop is better written as a list comprehension. – RemcoGerlich Feb 03 '19 at 13:18
0

The following probably solves your issue.

import re, json
path = 'path/to/your/file'
with open(path) as f:
    contents = f.read()
contents = re.sub('}', '},', contents)
contents = contents[:-1]
contents = '[' + contents + ']'
with open(path, 'w') as f:
    f.write(contents)
with open(path) as f:
    json_contents = json.load(f)
abdullah.cu
  • 674
  • 6
  • 11