66

I would like to know how to read several json files from a single folder (without specifying the files names, just that they are json files).

Also, it is possible to turn them into a pandas DataFrame?

Can you give me a basic example?

Ami Tavory
  • 74,578
  • 11
  • 141
  • 185
donpresente
  • 1,230
  • 2
  • 13
  • 24

9 Answers9

78

One option is listing all files in a directory with os.listdir and then finding only those that end in '.json':

import os, json
import pandas as pd

path_to_json = 'somedir/'
json_files = [pos_json for pos_json in os.listdir(path_to_json) if pos_json.endswith('.json')]
print(json_files)  # for me this prints ['foo.json']

Now you can use pandas DataFrame.from_dict to read in the json (a python dictionary at this point) to a pandas dataframe:

montreal_json = pd.DataFrame.from_dict(many_jsons[0])
print montreal_json['features'][0]['geometry']

Prints:

{u'type': u'Point', u'coordinates': [-73.6051013, 45.5115944]}

In this case I had appended some jsons to a list many_jsons. The first json in my list is actually a geojson with some geo data on Montreal. I'm familiar with the content already so I print out the 'geometry' which gives me the lon/lat of Montreal.

The following code sums up everything above:

import os, json
import pandas as pd

# this finds our json files
path_to_json = 'json/'
json_files = [pos_json for pos_json in os.listdir(path_to_json) if pos_json.endswith('.json')]

# here I define my pandas Dataframe with the columns I want to get from the json
jsons_data = pd.DataFrame(columns=['country', 'city', 'long/lat'])

# we need both the json and an index number so use enumerate()
for index, js in enumerate(json_files):
    with open(os.path.join(path_to_json, js)) as json_file:
        json_text = json.load(json_file)

        # here you need to know the layout of your json and each json has to have
        # the same structure (obviously not the structure I have here)
        country = json_text['features'][0]['properties']['country']
        city = json_text['features'][0]['properties']['name']
        lonlat = json_text['features'][0]['geometry']['coordinates']
        # here I push a list of data into a pandas DataFrame at row given by 'index'
        jsons_data.loc[index] = [country, city, lonlat]

# now that we have the pertinent json data in our DataFrame let's look at it
print(jsons_data)

for me this prints:

  country           city                   long/lat
0  Canada  Montreal city  [-73.6051013, 45.5115944]
1  Canada        Toronto  [-79.3849008, 43.6529206]

It may be helpful to know that for this code I had two geojsons in a directory name 'json'. Each json had the following structure:

{"features":
[{"properties":
{"osm_key":"boundary","extent":
[-73.9729016,45.7047897,-73.4734865,45.4100756],
"name":"Montreal city","state":"Quebec","osm_id":1634158,
"osm_type":"R","osm_value":"administrative","country":"Canada"},
"type":"Feature","geometry":
{"type":"Point","coordinates":
[-73.6051013,45.5115944]}}],
"type":"FeatureCollection"}
Scott
  • 6,089
  • 4
  • 34
  • 51
  • 1
    Really helpful. Instead of print my idea was to save all of them into one panda data frame, should what would be the correct code? create an empty data frame and begin adding rows to it? Thanks @Scott for this detail answer! – donpresente May 30 '15 at 08:43
  • 1
    @donpresente Good question. I'll post an edit to address how to get some desired data from a json and then push this data into a pandas DataFrame, row by row. – Scott May 30 '15 at 16:41
  • `import os, json` does not work on Colab, – Julien Aug 05 '22 at 13:41
  • How to read it to a list instead of a dataframe ? – Julien Aug 05 '22 at 13:49
24

Iterating a (flat) directory is easy with the glob module

from glob import glob

for f_name in glob('foo/*.json'):
    ...

As for reading JSON directly into pandas, see here.

Ami Tavory
  • 74,578
  • 11
  • 141
  • 185
11

Loads all files that end with * .json from a specific directory into a dict:

import os,json

path_to_json = '/lala/'

for file_name in [file for file in os.listdir(path_to_json) if file.endswith('.json')]:
  with open(path_to_json + file_name) as json_file:
    data = json.load(json_file)
    print(data)

Try it yourself: https://repl.it/@SmaMa/loadjsonfilesfromfolderintodict

Sma Ma
  • 3,343
  • 2
  • 31
  • 39
6

To read the json files,

import os
import glob

contents = []
json_dir_name = '/path/to/json/dir'

json_pattern = os.path.join(json_dir_name, '*.json')
file_list = glob.glob(json_pattern)
for file in file_list:
  contents.append(read(file))
Max Naude
  • 328
  • 5
  • 9
Saravana Kumar
  • 394
  • 1
  • 3
  • 13
3

I am using glob with pandas. Checkout the below code

import pandas as pd
from glob import glob

df = pd.concat([pd.read_json(f_name, lines=True) for f_name in glob('foo/*.json')])
Anand Tripathi
  • 14,556
  • 1
  • 47
  • 52
2

If turning into a pandas dataframe, use the pandas API.

More generally, you can use a generator..

def data_generator(my_path_regex):
    for filename in glob.glob(my_path_regex):
        for json_line in open(filename, 'r'):
            yield json.loads(json_line)


my_arr = [_json for _json in data_generator(my_path_regex)]
Union find
  • 7,759
  • 13
  • 60
  • 111
2

I feel a solution using pathlib is missing :)

from pathlib import Path

file_list = list(Path("/path/to/json/dir").glob("*.json"))
Original BBQ Sauce
  • 527
  • 1
  • 11
  • 24
1

A simple and very easy-to-understand answer.

import os 
import glob
import pandas  as pd



path_to_json = r'\path\here'
# import all files from folder which ends with .json 
json_files = glob.glob(os.path.join(path_to_json, '*.json'))

# convert all files to datafr`enter code here`ame
df = pd.concat((pd.read_json(f) for f in json_files))
print(df.head())
0

One more option is to read it as a PySpark Dataframe and then convert it to Pandas Dataframe (if really necessary, depending on the operation I'd suggest keeping as a PySpark DF). Spark natively handles using a directory with JSON files as the main path without the need of libraries for reading or iterating over each file:

# pip install pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
spark_df = spark.read.json('/some_dir_with_json/*.json')

Next, in order to convert into a Pandas Dataframe, you can do:

df = spark_df.toPandas()
Fernando Wittmann
  • 1,991
  • 20
  • 16