0

I'm very new to python but I would appreciate your help in guiding me in creating a simple script that reads through a bunch of .yaml files (about 300 files in the same directory) and extracts a certain section (electives only) from the .yaml file and converts it into a csv.

An example of what is in the .yaml file

code: 9313
degrees:
- name: Design
  coreCourses:
  - ABCD1
  - ABCD2
  - ABCD3
  electiveGroups: #this is the section i need to extract
    - label: Electives
      options:
        - Studio1
        - Studio2
        - Studio3
    - label: OtherElectives
      options:
        - Class1
        - Development2
        - lateclass1
   specialisations:
    - label: Honours

How I would like to see the output in csv:

.yaml file name | Electives   | Studio1
.yaml file name | Electives   | Studio2
.yaml file name | Electives   | Studio3
.yaml file name | OtherElectives   | class1
.yaml file name | OtherElectives   | Development2
.yaml file name | OtherElectives   | lateclass1

I'm assuming this will be a relatively simple script to write - but i'm looking for some help in writing this up. I'm very new at this so please be patient. I have written a few vba macros so i'm hoping I can catch on relatively quickly.

The best would be a complete solution with some guidance as to how the code is working.

Thanks for all your help in advance. I hope my problem is clear

This is my first attempt (albeit spent not to long on it):

import yaml
with open ('program_4803','r') as f:
    doc = yaml.load(f)
    txt=doc["electiveGroups"]["options"]
    file = open(“test.txt”,”w”) 
        file.write(“txt”) 
        file.close()

This is very incomplete at the moment as you can probably tell - but i'm trying to my hardest!

Bob Sha
  • 3
  • 1
  • 5

2 Answers2

2

This might help:

import yaml
import csv

yaml_file_names = ['data.yaml', 'data2.yaml']


rows_to_write = []

for idx, each_yaml_file in enumerate(yaml_file_names):
    print("Processing file ", idx+1, "of", len(yaml_file_names), "file name:", each_yaml_file)
    with open(each_yaml_file) as f:
        data = yaml.load(f)

        for each_dict in data['degrees']:
            for each_nested_dict in each_dict['electiveGroups']:
                for each_option in each_nested_dict['options']:
                    # write to csv yaml_file_name, each_nested_dict['label'], each_option
                    rows_to_write.append([each_yaml_file, each_nested_dict['label'], each_option])



with open('output_csv_file.csv', 'w') as out:
    csv_writer = csv.writer(out, delimiter='|')
    csv_writer.writerows(rows_to_write)
    print("Output file output_csv_file.csv created")

Tested this code with two mock input yaml's data.yaml and data2.yaml, whose contents were these:

data.yaml:

code: 9313
degrees:
- name: Design
  coreCourses:
  - ABCD1
  - ABCD2
  - ABCD3
  electiveGroups: #this is the section i need to extract
    - label: Electives
      options:
        - Studio1
        - Studio2
        - Studio3
    - label: OtherElectives
      options:
        - Class1
        - Development2
        - lateclass1
  specialisations:
  - label: Honours

and data2.yaml:

code: 9313
degrees:
- name: Design
  coreCourses:
  - ABCD1
  - ABCD2
  - ABCD3
  electiveGroups: #this is the section i need to extract
    - label: Electives
      options:
        - Studio1
    - label: E2
      options:
        - Class1
  specialisations:
  - label: Honours

and the output csv file generated was this:

data.yaml|Electives|Studio1
data.yaml|Electives|Studio2
data.yaml|Electives|Studio3
data.yaml|OtherElectives|Class1
data.yaml|OtherElectives|Development2
data.yaml|OtherElectives|lateclass1
data2.yaml|Electives|Studio1
data2.yaml|E2|Class1

and btw, the yaml input that you gave along with your question, it's last 2 lines were not properly indented

And as you said that you needed to parse 300 yaml files in a directory, well, you can use glob module of python for that, like this:

import yaml
import csv
import glob


yaml_file_names = glob.glob('./*.yaml')
# yaml_file_names = ['data.yaml', 'data2.yaml']

rows_to_write = []

for idx, each_yaml_file in enumerate(yaml_file_names):
    print("Processing file ", idx+1, "of", len(yaml_file_names), "file name:", each_yaml_file)
    with open(each_yaml_file) as f:
        data = yaml.load(f)

        for each_dict in data['degrees']:
            for each_nested_dict in each_dict['electiveGroups']:
                for each_option in each_nested_dict['options']:
                    # write to csv yaml_file_name, each_nested_dict['label'], each_option
                    rows_to_write.append([each_yaml_file, each_nested_dict['label'], each_option])



with open('output_csv_file.csv', 'w') as out:
    csv_writer = csv.writer(out, delimiter='|', quotechar=' ')
    csv_writer.writerows(rows_to_write)
    print("Output file output_csv_file.csv created")

Edit: as you asked in comments for skipping those yaml files where there is no electiveGroup section, here is the updated program:

import yaml
import csv
import glob


yaml_file_names = glob.glob('./*.yaml')
# yaml_file_names = ['data.yaml', 'data2.yaml']

rows_to_write = []

for idx, each_yaml_file in enumerate(yaml_file_names):
    print("Processing file ", idx+1, "of", len(yaml_file_names), "file name:", each_yaml_file)
    with open(each_yaml_file) as f:
        data = yaml.load(f)

        for each_dict in data['degrees']:
            try:
                for each_nested_dict in each_dict['electiveGroups']:
                    for each_option in each_nested_dict['options']:
                        # write to csv yaml_file_name, each_nested_dict['label'], each_option
                        rows_to_write.append([each_yaml_file, each_nested_dict['label'], each_option])
            except KeyError:
                print("No electiveGroups or options key found in", each_yaml_file)


with open('output_csv_file.csv', 'w') as out:
    csv_writer = csv.writer(out, delimiter='|', quotechar=' ')
    csv_writer.writerows(rows_to_write)
    print("Output file output_csv_file.csv created")
tkhurana96
  • 919
  • 7
  • 25
  • Wow this is fantastic!! found it very easy to navigate and learn from this. Thank you so much! Is there anyway to reward you for this answer? – Bob Sha Oct 11 '17 at 05:44
  • is there a way to skip those yaml files with no options or elective groups. I've checked the web and I'm getting the following suggestion to add: except: pass Is this appropriate? – Bob Sha Oct 11 '17 at 05:53
  • I've tried adding "try: [code] except: exception pass" within the for loop - but that didn't work - just produced an empty .csv – Bob Sha Oct 11 '17 at 06:00
  • @BobSha, updated my answer for the case where no electiveGroup section in the yaml file, if my answer helped you then upvote it and accept it(this you have already done) :) – tkhurana96 Oct 11 '17 at 06:32
0

For parsing yaml files, use the python yaml library

Example here: Parsing a YAML file in Python, and accessing the data?

For writing to a file, you do not need csv library

file = open(“testfile.txt”,”w”) 
file.write(“Hello World”) 
file.close() 

The above code will write to a file and you can just iterate the result of yaml parsing and write the output to the file accordingly.

omuthu
  • 5,948
  • 1
  • 27
  • 37
  • Thanks for this. I had a first attempt at this and it didnt quite work as well - Will keep trying!! – Bob Sha Oct 11 '17 at 05:44