Reading data in python

Question

I have log data in this format:

TIMESTAMP="Jun  7 2010 15:03:49 NZST" ACCESS-TYPE="ABC" TYPE="XYZ" PACKET-
TYPE="St" REASON="bkz" CIRCUIT-ID="UIX eth 1/1/11/20" REMOTE-ID="NBC" CALLING-
STATION-ID="LKP" SUB-ID="JIK"

How to read this as proper data frame ( rows and column) using Python. Where column names would be TIMESTAMP, ACCESS-TYPE and so on.

This is just one sample row from the data.

"Proper data frame"? Would you like a dictionary? An ordered dictionary? A matrix? Something else? — obskyr, Jun 22 '17 at 07:43
No. Proper dataframe, rows and columns... like a spreadsheet. — , Jun 22 '17 at 13:36

Maarten Fabré · Answer 1 · 2017-06-22T12:46:56.527

You can re to split each line to a list of tuples or a dict. You can use this to populate a DataFrame

def parse_logfile(log_file_handle):
    p = re.compile(r'\s*(.*?)="(.*?)"', )
    for line in log_file_handle:
        yield p.findall(line)

For the line you posted, this yields

[('TIMESTAMP', 'Jun  7 2010 15:03:49 NZST'),
 ('ACCESS-TYPE', 'ABC'),
 ('TYPE', 'XYZ'),
 ('PACKET-TYPE', 'St'),
 ('REASON', 'bkz'),
 ('CIRCUIT-ID', 'UIX eth 1/1/11/20'),
 ('REMOTE-ID', 'NBC'),
 ('CALLING-STATION-ID', 'LKP'),
 ('SUB-ID', 'JIK')]

So in another part of the code you can do something like.

with open(log_filename, 'r') as log_file_handle:
    log_lines = parse_logfile(log_file_handle)

    df = pd.DataFrame()
    for line in log_lines:
        df = df.append(dict(line), ignore_index=True, )

test_data

TIMESTAMP="Jun  7 2010 15:03:49 NZST" ACCESS-TYPE="ABC" TYPE="XYZ" PACKET-TYPE="St" REASON="bkz" CIRCUIT-ID="UIX eth 1/1/11/20" REMOTE-ID="NBC" CALLING-STATION-ID="LKP" SUB-ID="JIK"
TIMESTAMP="Jun  7 2010 15:03:50 NZST" ACCESS-TYPE1="ABC1" TYPE="XYZ" PACKET-TYPE="St" REASON="bkz" CIRCUIT-ID="UIX eth 1/1/11/20" REMOTE-ID="NBC" CALLING-STATION-ID="LKP" SUB-ID="JIK"
TIMESTAMP="Jun  7 2010 15:03:51 NZST" ACCESS-TYPE="ABC2" TYPE="XYZ" PACKET-TYPE="St" REASON="bkz" CIRCUIT-ID="UIX eth 1/1/11/20" REMOTE-ID="NBC" CALLING-STATION-ID="LKP" SUB-ID="JIK"

So I changed the timestamps and access-types and the second entry has ACCESS-TYPE1 instead of ACCESS-TYPE

result

    ACCESS-TYPE  CALLING-STATION-ID  CIRCUIT-ID         PACKET-TYPE  REASON  REMOTE-ID  SUB-ID  TIMESTAMP                 TYPE  ACCESS-TYPE1
0   ABC          LKP                 UIX eth 1/1/11/20  St           bkz     NBC        JIK     Jun 7 2010 15:03:49 NZST  XYZ   NaN
1   NaN          LKP                 UIX eth 1/1/11/20  St           bkz     NBC        JIK     Jun 7 2010 15:03:50 NZST  XYZ   ABC1
2   ABC2         LKP                 UIX eth 1/1/11/20  St           bkz     NBC        JIK     Jun 7 2010 15:03:51 NZST  XYZ   NaN

If all the lines have the same keys in the same order, the appending should be easy. If this changes throughout the file, this might become more difficult. Can you post more lines?

when I am calling function as "parse_logefile("D:/PC1/Data_folder/sample.log")" I am not getting any output. Its showing "generator object parse_logefile at 0x000000FEC1D8DAF0>" — , Jun 22 '17 at 09:17
That is because it is. You can read about geenrators [here](http://www.oreilly.com/programming/free/files/a-whirlwind-tour-of-python.pdf), [here](https://wiki.python.org/moin/Generators) or [here](https://stackoverflow.com/a/1756156/1562285) — Maarten Fabré, Jun 22 '17 at 09:20
Got it. Now how to bring this output data into csv file in dataframe format. — , Jun 22 '17 at 10:01
I have appended the data in a list.. how to convert it into dataframe ( rows and columns) now? — , Jun 22 '17 at 11:36
I added code to get it into a pandas DataFrame. The coloms are sorted differently, but that you can de with `reindex`. — Maarten Fabré, Jun 22 '17 at 12:48
Thank you. Its working absolutely fine. I have a q though, why u used "dict" keyword. Does it changes tuple to dictionary? — , Jun 22 '17 at 13:50
You can try with and without. It indeed changes the line from a list of tuples to a dict — Maarten Fabré, Jun 22 '17 at 13:51

score 1 · Answer 2 · answered Jun 22 '17 at 08:59

This is a nice simple example to use to create a small parser using pyparsing:

import pyparsing as pp

key = pp.Word(pp.alphas, pp.alphas+'-')
EQ = pp.Literal('=').suppress()
value = pp.QuotedString('"')
parser = pp.Dict(pp.OneOrMore(pp.Group(key + EQ + value)))

Use parser to parse your input data (joining the separate lines into one, since your sample input breaks some lines in the middle of a key):

sample = """\
TIMESTAMP="Jun  7 2010 15:03:49 NZST" ACCESS-TYPE="ABC" TYPE="XYZ" PACKET-
TYPE="St" REASON="bkz" CIRCUIT-ID="UIX eth 1/1/11/20" REMOTE-ID="NBC" CALLING-
STATION-ID="LKP" SUB-ID="JIK" """
sample = ''.join(sample.splitlines())

# parse the input string
result = parser.parseString(sample)

To get the results, access the results using dict or attribute notation, or call dump() to view keys and structure

print(result['PACKET-TYPE'])
print(list(result.keys()))
print(result.TYPE)
print("{TIMESTAMP}/{ACCESS-TYPE}/{CALLING-STATION-ID}".format(**result))
print(result.dump())

Prints:

St
['PACKET-TYPE', 'SUB-ID', 'REASON', 'CALLING-STATION-ID', 'ACCESS-TYPE', 'CIRCUIT-ID', 'REMOTE-ID', 'TYPE', 'TIMESTAMP']
XYZ
Jun  7 2010 15:03:49 NZST/ABC/LKP
[['TIMESTAMP', 'Jun  7 2010 15:03:49 NZST'], ['ACCESS-TYPE', 'ABC'], ['TYPE', 'XYZ'], ['PACKET-TYPE', 'St'], ['REASON', 'bkz'], ['CIRCUIT-ID', 'UIX eth 1/1/11/20'], ['REMOTE-ID', 'NBC'], ['CALLING-STATION-ID', 'LKP'], ['SUB-ID', 'JIK']]
- ACCESS-TYPE: 'ABC'
- CALLING-STATION-ID: 'LKP'
- CIRCUIT-ID: 'UIX eth 1/1/11/20'
- PACKET-TYPE: 'St'
- REASON: 'bkz'
- REMOTE-ID: 'NBC'
- SUB-ID: 'JIK'
- TIMESTAMP: 'Jun  7 2010 15:03:49 NZST'
- TYPE: 'XYZ'

Reading data in python

2 Answers2