1

I do realize this has already been addressed here (e.g., Reading csv zipped files in python, How can I parse a YAML file in Python, Retrieving data from a yaml file based on a Python list). Nevertheless, I hope this question was different.

I know loading a YAML file to pandas dataframe

import yaml
import pandas as pd

with open(r'1000851.yaml') as file:
    df = pd.io.json.json_normalize(yaml.load(file))

df.head()

I would like to read several yaml files from a directory into pandas dataframe and concatenate them into one big DataFrame. I have not been able to figure it out though...

import pandas as pd
import glob

path = r'../input/cricsheet-a-retrosheet-for-cricket/all' # use your path
all_files = glob.glob(path + "/*.yaml")

li = []

for filename in all_files:
    df = pd.json_normalize(yaml.load(filename, Loader=yaml.FullLoader))
    li.append(df)

frame = pd.concat(li, axis=0, ignore_index=True)

Error

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<timed exec> in <module>

/opt/conda/lib/python3.7/site-packages/pandas/io/json/_normalize.py in _json_normalize(data, record_path, meta, meta_prefix, record_prefix, errors, sep, max_level)
    268 
    269     if record_path is None:
--> 270         if any([isinstance(x, dict) for x in y.values()] for y in data):
    271             # naive normalization, this is idempotent for flat records
    272             # and potentially will inflate the data considerably for

/opt/conda/lib/python3.7/site-packages/pandas/io/json/_normalize.py in <genexpr>(.0)
    268 
    269     if record_path is None:
--> 270         if any([isinstance(x, dict) for x in y.values()] for y in data):
    271             # naive normalization, this is idempotent for flat records
    272             # and potentially will inflate the data considerably for

AttributeError: 'str' object has no attribute 'values'

Sample Dataset Zipped

Sample Dataset

Is there a way to do this and read files efficiently?

Ailurophile
  • 2,552
  • 7
  • 21
  • 46
  • 1
    The easiest method will be to define the dataframe first and then simply concat the new yaml file. For that you'd need to loop through your files, read them, convert them to df & concat. This is based on assumption all the files are sharing the same structure. Can you share what error are you getting? What's the problem – Danail Petrov Dec 28 '20 at 16:22
  • @DanailPetrov Shared & Updated the code I used – Ailurophile Dec 28 '20 at 16:29
  • 1
    It seems like some of the yaml files don't have all the values. You have a few options, I will post as answer for better readability – Danail Petrov Dec 28 '20 at 16:34
  • That would be very helpful – Ailurophile Dec 28 '20 at 16:38
  • I think there is another problem there. Just posted as answer. Check & let me know. – Danail Petrov Dec 28 '20 at 16:43
  • the runtime exceeds more than an hour for 7470 YAML files. Is there a way to load it efficiently?? – Ailurophile Dec 29 '20 at 06:41
  • The way you're loading files is alright, it's just that whether or not you need to load all these files at the same time, in the same data frame, do you need all the information from the yaml, etc, etc.. Which I suppose you'd admit is a whole new story and not quite related to your original question :-) Hope that makes sense. – Danail Petrov Dec 29 '20 at 10:32

1 Answers1

1

It seems your first part of the code and the second one you added is different.

First part correctly reads yaml files, but the second part is broken:

for filename in all_files:
    # `filename` here is just a string containing the name of the file. 
    df = pd.json_normalize(yaml.load(filename, Loader=yaml.FullLoader))
    li.append(df)

The problem is that you need to read the files. Currently you're just giving the filename and not the file content. Do this instead

li=[]
# Only loading 3 files:
for filename in all_files[:3]:
    with open(filename,'r') as fh:
        df = pd.json_normalize(yaml.safe_load(fh.read()))
    li.append(df)

len(li)
3

pd.concat(li)

output:
  
                                             innings  meta.data_version meta.created  meta.revision info.city info.competition  ... info.player_of_match                         info.teams info.toss.decision info.toss.winner              info.umpires                           info.venue
0  [{'1st innings': {'team': 'Glamorgan', 'delive...                0.9   2020-09-01              1   Bristol   Vitality Blast  ...          [AG Salter]       [Glamorgan, Gloucestershire]              field  Gloucestershire  [JH Evans, ID Blackwell]                        County Ground
0  [{'1st innings': {'team': 'Pune Warriors', 'de...                0.9   2013-05-19              1      Pune              IPL  ...          [LJ Wright]  [Pune Warriors, Delhi Daredevils]                bat    Pune Warriors    [NJ Llong, SJA Taufel]           Subrata Roy Sahara Stadium
0  [{'1st innings': {'team': 'Botswana', 'deliver...                0.9   2020-08-29              1  Gaborone              NaN  ...       [A Rangaswamy]              [Botswana, St Helena]                bat         Botswana   [R D'Mello, C Thorburn]  Botswana Cricket Association Oval 1

[3 rows x 18 columns]
Danail Petrov
  • 1,875
  • 10
  • 12