I am trying to evaluate some data using Python. A sample of the data is as follows
****************************************************************
* SAMPLE DATA *
****************************************************************
* Date Times Severity Source Machine State
18-May-2019 16:28:18 I StatesLog Off-Line States: IALT-1
18-May-2019 16:28:19 I StatesLog Off-Line States: TdALclr-0
18-May-2019 16:28:19 I StatesLog Off-Line States: S722a1-0, S722a2-0, S722ascon-0
18-May-2019 16:28:19 I StatesLog Off-Line States: !S722a1-1, S722(OFF)-0, !S722a2-1
What I'm after (eventually) is
Time Data
18-May-2019 16:28:18 IALT-1
18-May-2019 16:28:19 TdALclr-0
18-May-2019 16:28:19 S722a1-0,
18-May-2019 16:28:19 S722a2-0,
18-May-2019 16:28:19 S722ascon-0
18-May-2019 16:28:19 !S722a1-1,
18-May-2019 16:28:19 S722(OFF)-0,
18-May-2019 16:28:19 !S722a2-1
With data this short I can manually adjust the amount of columns I need, but as some of the data is over 100Mb in size I have no idea of how many columns I'll need to put it into a DataFrame.
I have tried the code below to remove the large header
import pandas as pd
with open('test.txt') as oldfile, open('newtest.txt', 'w') as newfile:
newfile.write('Date Times Severity Source Machine State Data Data1 Data2')
for line in oldfile:
if '*' not in line:
newfile.write(line)
df = pd.read_table('newtest.txt', sep ='\s+', engine = 'python')
df[['Date', 'Times', 'Data', 'Data1', 'Data2']].to_csv('trial.csv')
Which has worked up to a point but after a while with the normal data I get the a parse error that there were too many fields in the line from the "read_table" command "Expected X fields in line Z, saw Y". I presume this is because the number of columns is taken from the top line?
I need a way to read file to know the maximum amount of columns to pass somehow to pandas to eleviate the error. The column names don't matter for now as I can always adjust them later in the code.
Then hopefully the bottom part of my code will give me the results I'm after
df['Time'] = df['Date'].astype(str) + ' ' +df['Times']
a = df.set_index('Time').stack()
df = a[a !=0].reset_index(drop=True, level=1).reset_index(name='Data').to_csv('output.csv')