1

What is the best way for parsing below file? The blocks repeat multiple times.

The expected result is output to CSV file as:

{Place: REGION-1, Host: ABCD, Area: 44...}

I tried the code below, but it only iterates first blocks and than finishes.

 with open('/tmp/t2.txt', 'r') as input_data:
   for line in input_data:

    if re.findall('(.*_RV)\n',line):
       myDict={}
       myDict['HOST'] = line[6:]
       continue

    elif re.findall('Interface(.*)\n',line):
       myDict['INTF'] = line[6:]
    elif len(line.strip()) == 0:
       print(myDict)

Text file is below.

Instance REGION-1:
  ABCD_RV
    Interface: fastethernet01/01
    Last state change: 0h54m44s ago
    Sysid: 01441
    Speaks: IPv4
    Topologies:
      ipv4-unicast     
    SAPA: point-to-point
    Area Address(es):
      441
    IPv4 Address(es):
      1.1.1.1    

  EFGH_RV
    Interface: fastethernet01/01
    Last state change: 0h54m44s ago
    Sysid: 01442
    Speaks: IPv4
    Topologies:
      ipv4-unicast     
    SAPA: point-to-point
    Area Address(es):
      442
    IPv4 Address(es):
      1.1.1.2   

Instance REGION-2:
  IJKL_RV
    Interface: fastethernet01/01
    Last state change: 0h54m44s ago
    Sysid: 01443
    Speaks: IPv4
    Topologies:
      ipv4-unicast     
    SAPA: point-to-point
    Area Address(es):
      443
    IPv4 Address(es):
      1.1.1.3   
Pika Supports Ukraine
  • 3,612
  • 10
  • 26
  • 42
MCMCNJ CAM
  • 11
  • 2
  • 1
    Welcome to SO! Can you clarify your output format? It looks like fields are changing or being omitted, but the snippet is too brief to really get a sense of what you're going for. Please post the full version. – ggorlen Apr 11 '19 at 22:05
  • yes, not all are in key:value layout...some has key first and value on next line ...eg IPv4 Adress: and its value on next line 1.1.1.3 – MCMCNJ CAM Apr 12 '19 at 02:01

3 Answers3

1

This worked for me but it's not pretty:

text=input_data
text=text.rstrip(' ').rstrip('\n').strip('\n')
#first I get ready to create a csv by replacing the headers for the data
text=text.replace('Instance REGION-1:',',')
text=text.replace('Instance REGION-2:',',')
text=text.replace('Interface:',',')
text=text.replace('Last state change:',',')
text=text.replace('Sysid:',',')
text=text.replace('Speaks:',',')
text=text.replace('Topologies:',',')
text=text.replace('SAPA:',',')
text=text.replace('Area Address(es):',',')
text=text.replace('IPv4 Address(es):',',')

#now I strip out the leading whitespace, cuz it messes up the split on '\n\n'
lines=[x.lstrip(' ') for x in text.split('\n')]


clean_text=''

#now that the leading whitespace is gone I recreate the text file
for line in lines:
    clean_text+=line+'\n'

#Now split the data into groups based on single entries
entries=clean_text.split('\n\n')
#create one liners out of the entries so they can be split like csv
entry_lines=[x.replace('\n',' ') for x in entries]

#create a dataframe to hold the data for each line
df=pd.DataFrame(columns=['Instance REGION','Interface',
                         'Last state change','Sysid','Speaks',
                         'Topologies','SAPA','Area Address(es)',
                         'IPv4 Address(es)']).T

#now the meat and potatoes
count=0
for line in entry_lines:   
    data=line[1:].split(',')        #split like a csv on commas
    data=[x.lstrip(' ').rstrip(' ') for x in data]     #get rid of extra leading/trailing whitespace
    df[count]=data    #create an entry for each split
    count+=1          #incriment the count

df=df.T               #transpose back to normal so it doesn't look weird

Output looks like this for me

enter image description here

Edit: Also, since you have various answers here, I test the performance of mine. It is mildly exponential as described by the equation y = 100.97e^(0.0003x)

Here are my timeit results.

Entries Milliseconds
18      49
270     106
1620    394
178420  28400
bart cubrich
  • 1,184
  • 1
  • 14
  • 41
1

Or if you prefer an ugly regex route:

import re

region_re = re.compile("^Instance\s+([^:]+):.*")
host_re = re.compile("^\s+(.*?)_RV.*")
interface_re = re.compile("^\s+Interface:\s+(.*?)\s+")
other_re = re.compile("^\s+([^\s]+).*?:\s+([^\s]*){0,1}")

myDict = {}
extra = None
with open('/tmp/t2.txt', 'r') as input_data:
   for line in input_data:
        if extra: # value on next line from key
            myDict[extra] = line.strip()
            extra = None
            continue

        region = region_re.match(line)
        if region:
            if len(myDict) > 1:
                print(myDict)
            myDict = {'Place': region.group(1)}
            continue

        host = host_re.match(line)
        if host:
            if len(myDict) > 1:
                print(myDict)
            myDict = {'Place': myDict['Place'], 'Host': host.group(1)}
            continue

        interface = interface_re.match(line)
        if interface:
            myDict['INTF'] = interface.group(1)
            continue

        other =  other_re.match(line)
        if other:
            groups = other.groups()
            if groups[1]:
                myDict[groups[0]] = groups[1]
            else:
                extra = groups[0]

# dump out final one
if len(myDict) > 1:
    print(myDict)

output:

{'Place': 'REGION-1', 'Host': 'ABCD', 'INTF': 'fastethernet01/01', 'Last': '0h54m44s', 'Sysid': '01441', 'Speaks': 'IPv4', 'Topologies': 'ipv4-unicast', 'SAPA': 'point-to-point', 'Area': '441', 'IPv4': '1.1.1.1'}
{'Place': 'REGION-1', 'Host': 'EFGH', 'INTF': 'fastethernet01/01', 'Last': '0h54m44s', 'Sysid': '01442', 'Speaks': 'IPv4', 'Topologies': 'ipv4-unicast', 'SAPA': 'point-to-point', 'Area': '442', 'IPv4': '1.1.1.2'}
{'Place': 'REGION-2', 'Host': 'IJKL', 'INTF': 'fastethernet01/01', 'Last': '0h54m44s', 'Sysid': '01443', 'Speaks': 'IPv4', 'Topologies': 'ipv4-unicast', 'SAPA': 'point-to-point', 'Area': '443', 'IPv4': '1.1.1.3'}
estabroo
  • 189
  • 5
1

This doesn't use much regex and could be more optimized. Hope it helps!

import re
import pandas as pd
from collections import defaultdict

_level_1 = re.compile(r'instance region.*', re.IGNORECASE)
with open('stack_formatting.txt') as f:
    data = f.readlines()

"""
Format data so that it could be split easily
"""
data_blocks = defaultdict(lambda: defaultdict(str))
header = None
instance = None
for line in data:
    line = line.strip()
    if _level_1.match(line):
        header = line
    else:
        if "_RV" in line:
            instance = line
        elif not line.endswith(":"):
            data_blocks[header][instance] += line + ";"
        else:
            data_blocks[header][instance] += line


def parse_text(data_blocks):
    """
    Generate a dict which could be converted easily to a pandas dataframe
    :param data_blocks: splittable data
    :return: dict with row values for every column
    """
    final_data = defaultdict(list)
    for key1 in data_blocks.keys():
        for key2 in data_blocks.get(key1):
            final_data['instance'].append(key1)
            final_data['sub_instance'].append(key2)
            for items in data_blocks[key1][key2].split(";"):
                print(items)
                if items.isspace() or len(items) == 0:
                    continue
                a,b = re.split(r':\s*', items)
                final_data[a].append(b)
    return final_data


print(pd.DataFrame(parse_text(data_blocks)))
rhn89
  • 362
  • 3
  • 11
  • thanks, always like without regx !...i'll give it a try – MCMCNJ CAM Apr 12 '19 at 02:02
  • Did this work? I get an error as I start to increase the number of entries. I was going ot compare speeds and I can't get it to work. – bart cubrich Apr 12 '19 at 16:03
  • What is the error that you're getting? it a basic code which runs if all fields are present in all the instances. you can customize it to make it work for variations. – rhn89 Apr 12 '19 at 18:54
  • could you please explain whtat this means: data_blocks = defaultdict(lambda: defaultdict(str)) – MCMCNJ CAM Apr 15 '19 at 03:28
  • I wanted a dictionary of the instance and sub-instance to store the related data. You can read more about it here. https://stackoverflow.com/questions/19189274/nested-defaultdict-of-defaultdict – rhn89 Apr 15 '19 at 15:01
  • how will i add one more level to the 2 dimensional dict ? say if the above text file had another level where host_name would be the root, followed by Instance Region, host_DV and so on... – MCMCNJ CAM Apr 15 '19 at 17:45
  • `def dd(): return defaultdict(dd) data_blocks = dd() data_blocks['a']['b']['c']['d'] = "This is the value" ` – rhn89 Apr 15 '19 at 19:56
  • sorry not clear to me...but if i were to modify your above code...would these be the main changes: data_blocks = defaultdict(lambda: defaultdict(lambda: defaultdict(str))) and conditinal lines add one more elif ...and to the parase def...add one more key loop...final_data['sub_instance2'].append(key3).....only problem i am having is that the key3 seems to be empty even though it is populated above in elif – MCMCNJ CAM Apr 15 '19 at 20:25
  • ` in the last line you have to do data[key1][key2][key3] = "whatever" ` I have created a gist from my code - check it out if you have any problems - https://gist.github.com/ramanujam/fe249d1b39f90a509021661e2b679399 – rhn89 Apr 15 '19 at 21:56