Is there a way to read bulk `yaml` files into a pandas `dataframe` more efficiently(faster) in Python

Question

I would like to read several yaml files from a directory into pandas dataframe and concatenate them into one big DataFrame. The directory consists of 7470 files.

%%time
import pandas as pd
import glob

path = r'../input/cricsheet-a-retrosheet-for-cricket/all' # use your path
all_files = glob.glob(path + "/*.yaml")

li = []

for filename in all_files:
    with open(filename,'r') as fh:
        df = pd.json_normalize(yaml.load(fh.read()))
    li.append(df)

frame = pd.concat(li, axis=0, ignore_index=True)

CPU times: user 1h 15min 38s, sys: 8.8 s, total: 1h 15min 47s
Wall time: 1h 16min 44s

The runtime of the code exceeds more than an hour

Is there a way to read bulk yaml files into a pandas dataframe more efficiently?

Sample Dataset Zipped

Sample Dataset

Have you done any profiling to see where the execution time is being used? The breakdown between `pd.json_normalize`, `yaml.load`, `fh.read` and `pd.concat` would be interesting to know. — Bill, Dec 29 '20 at 06:06
@Pluviophile `threading` and `thread`. You can also look into `multiprocessing`. These are all modules. — M Z, Dec 29 '20 at 06:07
@Bill, No, I'm a starter and I really don't know doing profiling for runtime — Ailurophile, Dec 29 '20 at 06:16
Have you tried to concat yaml-files before load? It must be possible with [yq](https://github.com/mikefarah/yq) — Alex Yu, Dec 29 '20 at 06:48
@Pluviophile B.t.w. note that `yaml.load` is [no-longer recommended](https://github.com/yaml/pyyaml/wiki/PyYAML-yaml.load(input)-Deprecation). Use `yaml.safe_load`. — Bill, Dec 29 '20 at 19:31
@Pluviophile There are [many options for profiling Python scripts](https://stackoverflow.com/questions/582336/how-can-you-profile-a-python-script) but using the built-in [cProfile](https://docs.python.org/3/library/profile.html) is probably the simplest option. I ran it on your original code and found that 93% of execution time was spent in `yaml.safe_load`. This confirms that you are not going to get anything faster using pyyaml and so looking into multi-threading is a good suggestion. — Bill, Dec 29 '20 at 19:32
@Bill could u plz share with me the code that u ran for profiling — Ailurophile, Dec 30 '20 at 05:12
You should check out the documentation for cProlfile but I used it from the command line this way: `python -m cProfile -s cumtime myscript.py` where myscript.py is the script you want to profile. The option `-s cumtime` just sorts the results by cumulative time spent in each function. — Bill, Dec 30 '20 at 19:18

score 1 · Accepted Answer · answered Dec 29 '20 at 06:50

step1: convert yaml to json file, use multi-process

import os
from datetime import datetime, timedelta
from pandas import json_normalize
import pandas as pd
import numpy as np
import yaml

import logging
logger = logging.getLogger()
logger.setLevel(logging.INFO)
ch = logging.StreamHandler()
ch.setLevel(logging.INFO)
formatter = logging.Formatter('%(asctime)s - %(levelname)s - %(message)s')
ch.setFormatter(formatter)
logger.addHandler(ch)

# yaml file path
os.chdir('~/Downloads/all')
yaml_file_list = os.listdir('.')
yaml_file_list =  [i for i in yaml_file_list if i.endswith('yaml')]

if not os.path.exists('output'):
    os.mkdir('output')
    
def yaml2json(file, cnt = 1):
    file_json = f'output/{file}.json'
    if os.path.exists(file_json):
        return
    # read yaml and convert to dict
    with open(file,'r') as fh:
        data = yaml.load(fh.read(), Loader=yaml.BaseLoader)
    
    # convert to json file
    data_str = json.dumps(data, ensure_ascii=False) + '\n'
    with open(file_json, 'w') as fw:
        fw.write(data_str)
    logging.info(f'[{cnt}] {file_json}')

file = yaml_file_list[0]
yaml2json(file)

# muti-Process to handle file to json
from concurrent.futures import ProcessPoolExecutor
####################
workers = 8
pool_list = yaml_file_list
pool_func = yaml2json
####################
total_count = len(pool_list)
with ProcessPoolExecutor(max_workers=workers) as executor:
    futures = [executor.submit(pool_func, param, total_count-n) 
                        for n, param in enumerate(pool_list)
            ]
  
# 2020-12-29 14:29:19,648 - INFO - [7468] output/1163055.yaml.json   
# 2020-12-29 14:32:07,597 - INFO - [6466] output/640941.yaml.json

# macbook 15' 2015
# 2.2 GHz Intel Core i7
# 16 GB 1600 MHz DDR3
# 1000 -> 14:29:19,648 -> 14:32:07,597 -> 3min
# 7400 ~ 25min

step2: merge json file to one file

os.chdir('~/Downloads/all/output/')

# merge file use bash cmd cat
# !cat *.json > yaml-all-json
# ipython
pycmd = lambda cmd: get_ipython().system(cmd)
cmd = 'cat *.json > yaml-all-json'
# pycmd(cmd)

step3: read json file

# read file
# 1478 lines ->  4.37s
file = 'yaml-all-json'
df = pd.read_csv(file, sep='\n', header=None)[0]
obj = df.map(json.loads)
data_list = obj.tolist()
df_data = pd.DataFrame(data_list) # or use json_normalize to parse json data

df_data
#   meta    info    innings
# 0 {'data_version': '0.9', 'created': '2016-12-05...   {'dates': ['2016-11-24', '2016-11-25', '2016-1...   [{'1st innings': {'team': 'South Africa', 'dec...
# 1 {'data_version': '0.9', 'created': '2016-12-21...   {'city': 'Brisbane', 'dates': ['2016-12-15', '...   [{'1st innings': {'team': 'Australia', 'delive...
# 2 {'data_version': '0.9', 'created': '2016-10-21...   {'city': 'Port Moresby', 'dates': ['2016-10-16...   [{'1st innings': {'team': 'Papua New Guinea', ...
# 3 {'data_version': '0.9', 'created': '2016-09-14...   {'city': 'Edinburgh', 'dates': ['2016-09-10'],...   [{'1st innings': {'team': 'Scotland', 'deliver...
# 4 {'data_version': '0.9', 'created': '2016-09-12...   {'city': 'Londonderry', 'dates': ['2016-09-05'...   [{'1st innings': {'team': 'Hong Kong', 'delive..

I thought of something similar. Although I would try [yq](https://github.com/mikefarah/yq) to merge yamls. It must be similar with `jq --slurp` — Alex Yu, Dec 29 '20 at 06:57

score 1 · Answer 2 · answered Dec 29 '20 at 21:25

Dask is a great package if you want to avoid getting into the details of parallel computing. It's really designed for distributed computing on machines with many CPUs but I find the syntax is convenient even if you're just using it for multi-threading or multiple processes on one machine.

Here is some code that loads 100 of the yaml files into memory, firstly without using Dask:

import glob
import yaml

path = r'all'  # local folder
all_files = glob.glob(path + "/*.yaml")

def load_yaml_file(filename):
    with open(filename, 'r') as fh:
        d = yaml.safe_load(fh.read())
    return d

n = 100
results = []
for filename in all_files[:n]:
    d = load_yaml_file(filename)
    results.append(d)
assert len(results) == n

Then, using Dask:

import dask

n = 100
lazy_results = []
for filename in all_files[:n]:
    d = dask.delayed(load_yaml_file)(filename)
    lazy_results.append(d)

results = dask.compute(*lazy_results, scheduler='processes')
assert len(results) == n

I timed both the above on my machine which has a quad-core processor and found that with Dask it took about 19 s (wall time) compared to just over 1 min without (about a 3.1x speed up).

Is there a way to read bulk `yaml` files into a pandas `dataframe` more efficiently(faster) in Python

2 Answers2

Linked