I would suggest you use a script to first convert your data into a suitable CSV format that could then be loaded without problem using Pandas.
Most of the data appears to be in a fixed width format, as such the column locations could be hard coded, and strip()
applied to remove any whitespace. The time and date can be extracted from the timestamp line and added to the start of each entry within it.
groupby()
is used to read the file in blocks, this also has the effect of skipping the unwanted line between segments. It works by reading the file a line at a time and passing each line to a key
function (this is provided inline using a lambda
function to save space). If the result of the function changes, the groupby function returns a list of lines which have the same return. In this case it is testing if the line does not start with the A=<
characters. So you get one return where key
is True
holding a list of lines starting with the timestamp. Then a list of lines (in this case just one) with the lines starting A=<
. This makes it easier to then process a whole segment without having to worry about where it finishes. The first entry is the timestamp followed by all the entries.
A list comprehension is used to extract each individual value from the line and create a row
of values. The pairwise()
recipe is used to read the fixed column locations from cols
to get a sliding start
and end
column positions. These are used as a string slice to extract the characters for each column. The string then has .strip()
applied to remove any surrounding spaces.
from itertools import groupby, tee
import csv
def pairwise(iterable):
"s -> (s0,s1), (s1,s2), (s2, s3), ..."
a, b = tee(iterable)
next(b, None)
return zip(a, b)
cols = [0, 5, 12, 15, 19, 24, 28, 32, 37, 41, 45, 50, 54, 58, 63, 68, 71, 76] # + rest
header = None
with open('data.txt') as f_input, open('output.csv', 'w', newline='') as f_output:
csv_output = csv.writer(f_output)
for key, group in groupby(f_input, lambda x: not x.startswith('A=<')):
if key:
# Extract the timestamp from the first returned line in the segment
# Split the line on spaces and take the first 3 elements
timestamp = next(group).split()[:3]
block = []
for line in group:
row = [line[start:end].strip() for start, end in pairwise(cols)]
row.append(line[cols[-1]:].strip()) # Add GHI column
block.append(timestamp + row)
if not header:
header = block[0][3:]
header[2] = 'Unknown'
csv_output.writerow(['Day', 'Date', 'Time'] + header)
csv_output.writerows(block[1:])
For the data you have given, this would give you
Day,Date,Time,IFJ,SC/LY,Unknown,AB,CD?,EF,GH,IJK,LM,NO,PQR,ST,UV,WXY,ZA,BC,DEF,GHI
Tuesday,19-April-2010,00:01,1234,SC 122,,A,20?,31,1,4?,10,3,2?,19,9,5?,-,,-?,30
Tuesday,19-April-2010,00:01,1234,SC 123,',B,60?,11,2,3?,10,5,6?,19,9,4?,17,4,1?,30
Tuesday,19-April-2010,00:01,1234,SC 124,,C,20?,21,2,2?,10,2,1?,9,9,0?,-,,-?,34
Tuesday,19-April-2010,00:01,1234,SC 125,^,1,100?,-,,-?,0,3,3?,-,,-?,-,,-?,10
Tuesday,19-April-2010,00:01,1234,SC 226,*,5,60?,-,,-?,14,4,3?,9,9,2?,7,3,3?,18
Tuesday,19-April-2010,00:03,1234,SC 122,,A,21?,31,5,4?,17,3,2?,19,9,1?,-,,-?,31
Tuesday,19-April-2010,00:03,1234,SC 123,',B,61?,11,2,3?,19,5,6?,19,9,4?,17,4,3?,32
Tuesday,19-April-2010,00:03,1234,SC 124,,C,21?,21,5,2?,10,2,1?,9,9,0?,-,,-?,33
Tuesday,19-April-2010,00:03,1234,SC 125,^,1,101?,-,,-?,7,3,3?,-,,-?,-,,-?,14
Tuesday,19-April-2010,00:03,0000,SC 226,*,5,61?,-,,-?,14,4,3?,9,9,2?,7,3,3?,18
To get a better understanding of how the script actually works, I would recommend you add create a small test file (with two segments) and then also add some print statements.