0

I've pulled data from Twitter. Currently, the data is in multiple files and I could not merge it into one single file.

Note: all files are in JSON format.

The code I have used is here and here.

It has been suggested to work with glop to compile JSON files

I write this code as I have seen in some tutorials about merge JSON by using Python

from glob import glob 
import json
import pandas as pd

with open('Desktop/json/finalmerge.json', 'w') as f: 
    for fname in glob('Desktop/json/*.json'): # Reads all json from the current directory 
        with open(fname) as j: 
            f.write(str(j.read())) 
            f.write('\n')

I successfully merge all files and now the file is finalmerge.json.

Now I used this as suggested in several threads:

df_lines = pd.read_json('finalmerge.json', lines=True)
df_lines
1000000*23 columns 

Then, what I should do to make each feature in separate columns?

I'm not sure why what's wrong with JSON files, I checked the file that I merge and I found it's not valid as JSON file? what I should do to make this as a data frame?

The reason I am asking this is that I have very basic python knowledge and all the answers to similar questions that I have found are way more complicated than I can understand. Please help this new python user to convert multiple JSON files to one JSON file.

ggorlen
  • 44,755
  • 7
  • 76
  • 106
ML Moh
  • 3
  • 4

1 Answers1

1

I think that the problem is that your files are not really json (or better, they are structured as jsonl ). You have two ways of proceding:

  1. you could read every file as a text file and merge them line by line
  2. you could convert them to json (add a square bracket at the beginning of the file and a comma at the end of every json element).

Try following this question and let me know if it solves your problem: Loading JSONL file as JSON objects

You can also try to edit your code this way:

with open('finalmerge.json', 'w') as f:
    for fname in glob('Desktop/json/*.json'): 
        with open(fname) as j:
            f.write(str(j.read()))
            f.write('\n')

Every line will be a different json element.

  • It's work but only with one file, so I don't know how to make it work for multiple files on the same time? – ML Moh Oct 19 '20 at 17:31
  • I think the easiest way would be something like this: from glob import glob with open('finalmerge.json', 'w') as f: for fname in glob('Desktop/json/*.json'): # Reads all json from the current directory with open(fname) as j: f.write(str(j.read())) f.write('\n') I will write this in my main answer – emanuele_maruzzi Oct 19 '20 at 17:35
  • Thank you, the merge fils is generated but it's really large, now if I want to convert it to panda as the data frame is there a specific instruction to do it correctly since I don't know this jsonl .Second, how I validate the JSON file .due to its size is there is a way I double-check if the data frame in panda will not be rejected the file – ML Moh Oct 19 '20 at 17:53
  • In the pandas documentation you can find the lines option while reading a json file. https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_json.html The easiest way to check if the generated file is ok is just to scroll it quickly and see if there's some weird lines – emanuele_maruzzi Oct 19 '20 at 18:24
  • I realized that the file that I merge is not valid JSON file . I will write this in my main question. – ML Moh Oct 20 '20 at 00:14
  • Can’t you use this line of code to load your data in a data frame? df = pd.read_json(‘finalmerge.json’,lines=True) – emanuele_maruzzi Oct 20 '20 at 05:38
  • Thank you for your comment , I don't think I do understand the code you have added? what is lines = True ? – ML Moh Oct 20 '20 at 18:48
  • lines = True is a parameter that will be passed to pandas read_json function, to let it know that the file is a jsonl file. I haven’t checked but I’m sure that the json_normalize has the same parameter available – emanuele_maruzzi Oct 21 '20 at 06:13
  • Thank you , it put all the data in one line ? is that correct ? I add the result to the main question – ML Moh Oct 21 '20 at 16:57
  • Can you show me the file? Just to be sure on how the files are merged.I will try to help you as much as I can :) – emanuele_maruzzi Oct 21 '20 at 19:30
  • df_lines = pd.read_json('finalmerge.json', lines=True) df_lines 1000000*23 columns Then, what I should do to make each feature in separate columns? . I want to share the file but due to the size of the file I don't be able to open it (out of memory) – ML Moh Oct 22 '20 at 13:33
  • Can you just share the first 30 lines of it? If you managed to create the Dataframe then you should be able to user every column as a feature – emanuele_maruzzi Oct 23 '20 at 17:35
  • I created the data frame, but it was fill of NAN , then I checked several files and I found that the problem is JSON files need to be valid before I merge them, please correct me If I'm wrong? is setting lines=True is not the reasons to create NAN , my concern is my files are not JSON actually that makes me feel hopeless.Please when you said 30 lines do you the results or the file after the merge? – ML Moh Oct 23 '20 at 19:06
  • Yes, just so I can try it on my pc. Thanks – emanuele_maruzzi Oct 24 '20 at 07:45