0

I am building a ML classifier. For that, I have a dataset which is divided into 6 .jsonl files. Each of them is more than 1.6GB. At first I tried the following code :

import pandas as pd
data=pd.read_json("train_features_0.jsonl")

Which gave me the error "trailingError".

So I used "chunksize" and "lines" within "read_json".

import pandas as pd
data=pd.read_json("train_features_0.jsonl", chunksize=100,lines=True)

Which is giving "pandas.io.json.json.JsonReader at 0x136bce302b0"

Dataset consists of : train_features_0.jsonl, train_features_1.jsonl, train_features_2.jsonl, train_features_3.jsonl, train_features_4.jsonl, train_features_5.jsonl.

So my question is how can I use all those .jsonl files to train my classifier?

Another question is how can I use specific "name:value" pairs while training my classifier ..? I mean can I drop some name:value pairs to speed up training process.

Please pardon me, I am new to ML.

  • You can try the solution mentioned in the answer: https://stackoverflow.com/questions/46790390/how-to-read-a-large-json-in-pandas – stark Dec 08 '19 at 12:20
  • @stark9190 I have taken a look at the link you provided. Yes, each ".jsonl" file contains multiple json objects, that is why I have used "lines=True". If I pass the arguments "file_path" and "lines" only, without using "chunksize", as : pd.read_json("train_features_0.jsonl",lines=True) or pd.read_json("train_features_0.jsonl", lines=True), then my system goes down. – Shamindra Parui Dec 08 '19 at 13:41
  • this library might be helpful https://github.com/openlegaldata/legal-ner – BBK Feb 13 '22 at 22:06

0 Answers0