0

I have a text file to convert to YAML format. Here are some notes to describe the problem a little better:

  • The sections within the file have a different number of subheadings to each other.
  • The values of the subheadings can be any data type (e.g. string, bool, int, double, datetime).
  • The file is approximately 2,000 lines long.

An example of the format is below:

file_content = '''
    Section section_1
        section_1_subheading1 = text
        section_1_subheading2 = bool
    end
    Section section_2
       section_2_subheading3 = int
       section_2_subheading4 = double
       section_2_subheading5 = bool
       section_2_subheading6 = text
       section_2_subheading7 = datetime
    end
    Section section_3
       section_3_subheading8 = numeric
       section_3_subheading9 = int
    end
'''

I have tried to convert the text to YAML format by:

  1. Replacing the equal signs with colons using regex.
  2. Replacing Section section_name with section_name :.
  3. Removing end between each section.

However, I am having difficulty with #2 and #3. This is the text-to-YAML function I have created so far:

import yaml
import re

def convert_txt_to_yaml(file_content):
    """Converts a text file to a YAML file"""

    # Replace "=" with ":"
    file_content2 = file_content.replace("=", ":")

    # Split the lines 
    lines = file_content2.splitlines()

    # Define section headings to find and replace
    section_names = "Section "
    section_headings = r"(?<=Section )(.*)$"
    section_colons = r"\1 : "
    end_names = "end"

    # Convert to YAML format, line-by-line
    for line in lines:
        add_colon = re.sub(section_headings, section_colons, line) # Add colon to end of section name
        remove_section_word = re.sub(section_names, "", add_colon) # Remove "Section " in section header
        line = re.sub(end_names, "", remove_section_word)          # Remove "end" between sections

    # Join lines back together
    converted_file = "\n".join(lines)
    return converted_file

I believe the problem is within the for loop - I can't manage to figure out why the section headers and endings aren't changing. It prints perfectly if I test it, but the lines themselves aren't saving.

The output format I am looking for is the following:

file_content = '''
    section_1 :
        section_1_subheading1 : text
        section_1_subheading2 : bool
    section_2 :
        section_2_subheading3 : int
        section_2_subheading4 : double
        section_2_subheading5 : bool
        section_2_subheading6 : text
        section_2_subheading7 : datetime
    section_3 :
        section_3_subheading8 : numeric
        section_3_subheading9 : int
'''
mj_whales
  • 124
  • 2
  • 11
  • What is the output format you’re looking for? – Abhijit Sarkar Aug 03 '20 at 09:50
  • Hi @AbhijitSarkar, I have just added the output format. Thank you for the reminder. – mj_whales Aug 04 '20 at 04:38
  • How do you know whether `13` in the source file is a string, or a number in the output file? You need to clarify your question, basic facts are missing. – Abhijit Sarkar Aug 04 '20 at 05:34
  • I don't know if that is a valid "basic fact", considering YAML does not require quotes around strings unless special characters are present. There are no single or double quotes in the source file. – mj_whales Aug 04 '20 at 05:52
  • In that case, the mention of all the data types in your input file is misleading, since you intend to merely copy over the values. The only thing that’s invalid here is your depiction of the problem, which now seems to be a trivial one. – Abhijit Sarkar Aug 04 '20 at 05:58

1 Answers1

2

I would rather convert it to dict and then format it as yaml using the yaml package in python as below:

import yaml
def convert_txt_to_yaml(file_content):
    """Converts a text file to a YAML file"""
    config_dict = {}
    
    # Split the lines 
    lines = file_content.splitlines()
    section_title=None
    for line in lines:
        if line=='\n':
            continue
        elif re.match('.*end$', line):
            #End of section
            section_title=None
        elif re.match('.*Section\s+.*', line):
            #Start of Section
            match_obj =  re.match(".*Section\s+(.*)", line)
            section_title=match_obj.groups()[0]
            config_dict[section_title] = {}
        elif section_title and re.match(".*{}_.*\s+=.*".format(section_title), line):
            match_obj =  re.match(".*{}_(.*)\s+=(.*)".format(section_title), line)            
            config_dict[section_title][match_obj.groups()[0]] = match_obj.groups()[1]
    return yaml.dump(config_dict )
  • Thank you very much for this Arun, it solves my problem with the section beginnings and endings conceptually, however the output is an error and the values are no longer printed. I will work on this independently but if you have any suggestions, it would be very much appreciated – mj_whales Aug 04 '20 at 05:57
  • Can you tell me what error you are facing.. Because when i tried i dint see any error and a proper YAML as dumped.. – Arun Kaliraja Baskaran Aug 04 '20 at 14:14
  • I get this error: https://stackoverflow.com/questions/25584124/oserror-errno-22-invalid-argument-when-use-open-in-python which is clearly fixable. So you have solved my problem. – mj_whales Aug 12 '20 at 11:34