5

I have a file file1.json whose contents are like this (each dict in a separate line):

{"a":1,"b":2}
{"c":3,"d":4}
{"e":9,"f":6}
.
.
.
{"u":31,"v":23}
{"w":87,"x":46}
{"y":98,"z":68}

I want to load this file into a pandas dataframe, so this is what i did:

df = pd.read_json('../Dataset/file1.json', orient='columns', lines=True, chunksize=10)

But this instead of returning a dataframe returns a JSONReader.

[IN]: df
[OUT]: <pandas.io.json.json.JsonReader at 0x7f873465bd30>

Is it normal, or am i doing something wrong? And if this is how read_json() is supposed to behave when there're multiple dictionaries in a single json file (without being any comma separated) and with each dict in a separate line, then how can i best fit them into a dataframe?

EDIT: if i remove the chunksize paramter from the read_json() this is what i get:

[IN]: df = pd.read_json('../Dataset/file1.json', orient='columns', lines=True)
[OUT]: ValueError: Expected object or value
Aman Singh
  • 1,111
  • 3
  • 17
  • 31
  • that's what `chunksize` does. see the doc: http://pandas.pydata.org/pandas-docs/stable/io.html#io-jsonl – njzk2 May 17 '18 at 05:33
  • thing is if i don't add the parameter chunksize it gives out an error as `ValueError: Expected object or value` also it doesn't recognize the file as valid json object as each dictionary is separated by a new line character – Aman Singh May 17 '18 at 05:37
  • 1
    @AmanSingh It sounds like the problem with your other attempt is that you didn't use `lines=True`, so you were telling it that you had a single JSON text rather than a file full of line-delimited JSON texts, which isn't true, so it gives you an error. But if that's not it, create a new question. – abarnert May 17 '18 at 05:47
  • @AmanSingh - are data confidental? – jezrael May 17 '18 at 05:53
  • The problem does not happen with your sample input. If it happens with your real input, you have to figure out how to give us sample input that causes the same error, or we can't help you. But as I already told you, create a new question for a new problem, don't try to edit all of your problems into one question. – abarnert May 17 '18 at 06:00
  • Actually, the problem _does_ happen with your sample input if I leave those `.` lines in. Are those actually in your real file? – abarnert May 17 '18 at 06:18
  • @abarnet yes the data is confidential. I'll try recreating a new sample input dataset for new question. And no, the `...` aren't actually present in the actual dataset i added those to make it a understanding that there are many more such records in between. – Aman Singh May 17 '18 at 07:37

1 Answers1

3

As the docs explain, this is exactly the point of the chunksize parameter:

chunksize: integer, default None

Return JsonReader object for iteration. See the line-delimted json docs for more information on chunksize. This can only be passed if lines=True. If this is None, the file will be read into memory all at once.

The linked docs say:

For line-delimited json files, pandas can also return an iterator which reads in chunksize lines at a time. This can be useful for large files or to read from a stream.

… and then give an example of how to use it.

If you don't want that, why are you passing chunksize? Just leave it out.

Community
  • 1
  • 1
abarnert
  • 354,177
  • 51
  • 601
  • 671
  • thing is if i don't add the parameter chunksize it gives out an error as `ValueError: Expected object or value` – Aman Singh May 17 '18 at 05:34
  • @AmanSingh Then you have another error, and the `chunksize` was just masking it—you don't actually read anything, and therefore don't see the other error, until you `for chunk in reader:` or similar. – abarnert May 17 '18 at 05:35
  • @AmanSingh `lines=True` makes it read JSON lines instead of a single JSON text. `chunksize=10` makes it _also_ give you a reader object that reads chunks of 10 lines at a time instead of the whole file. Just throwing random arguments at it until it seems to work isn't going to get you anywhere; read the docs. – abarnert May 17 '18 at 05:43
  • @AmanSingh Meanwhile, if you need help debugging the other problem this one was masking, create a new question with a [mcve] for that question—the code that uses `lines` but not `chunksize`, sample input (ideally something we can copy and paste without removing the `...` lines in the middle), and the traceback—and you should get an answer to that one as well. – abarnert May 17 '18 at 05:46
  • @abarnet I have added the output it gives when i try to read it as a whole into the memory. You could possibly be very right, the issue could be something else, but i'm not able to identify it. Pls check the edit – Aman Singh May 17 '18 at 05:49
  • 1
    my answer is OK, `df = pd.concat(df)` working nice in my sample, do you test it? – jezrael May 17 '18 at 05:51