Structuring of messy data using python

Question

I have a set of data which is not tab or comma delimited. When open in Notepad, it looks very structured as shown below but when open with Excel, It is not structured properly

Tuesday 19-April-2010 00:01 CC  11   PQ 1.1  PS1.1 CS# 111 +1 RS113' SC 183 ZA 71
 IFJ   SC/LY     AB  CD?  EF  GH  IJK  LM  NO  PQR  ST  UV  WXY  ZA  BC  DEF GHI
 1234 SC 122      A  20?  31   1   4?  10   3   2?  19   9   5?   -       -?   30
 1234 SC 123  '   B  60?  11   2   3?  10   5   6?  19   9   4?  17   4   1?   30
 1234 SC 124      C  20?  21   2   2?  10   2   1?   9   9   0?   -       -?   34
 1234 SC 125 ^    1 100?   -       -?   0   3   3?   -       -?   -       -?   10
 1234 SC 226 *    5  60?   -       -?  14   4   3?   9   9   2?   7   3   3?   18
A=<43>  B=33  C=24
Tuesday 19-April-2010 00:03 CC  11   PQ 1.1  PS1.1 CS# 111 +1 RS113' SC 183 ZA 71
 IFJ   SC/LY     AB  CD?  EF  GH  IJK  LM  NO  PQR  ST  UV  WXY  ZA  BC  DEF GHI
 1234 SC 122      A  21?  31   5   4?  17   3   2?  19   9   1?   -       -?   31
 1234 SC 123  '   B  61?  11   2   3?  19   5   6?  19   9   4?  17   4   3?   32
 1234 SC 124      C  21?  21   5   2?  10   2   1?   9   9   0?   -       -?   33
 1234 SC 125 ^    1 101?   -       -?   7   3   3?   -       -?   -       -?   14
 0000 SC 226 *    5  61?   -       -?  14   4   3?   9   9   2?   7   3   3?   18

Is there any ways that I can use pandas to process such data for some analysis?

this data doesnt look structured.looking at the column name "Tuesday 19-April-2010 00:01". It looks like it has been modified. — MEdwin, Nov 23 '18 at 09:40
Could you use something like [pastebin](https://pastebin.com/) to upload a sample file to (post a link to it here). That way we can be sure of the exact formating of the file. — Martin Evans, Nov 23 '18 at 10:10
This looks like perfect regex expression territory. The symbol characters don’t *look* like they’re important (only you can really determine that without us having more data).Multi white space + symbol character delimiters. — dijksterhuis, Nov 23 '18 at 11:40
See this answer to delimit by regex using pandas https://stackoverflow.com/a/19632099 — dijksterhuis, Nov 23 '18 at 12:01
Having had another look, I think you might have to go for fixed width delimiting and then cleaning whitespace tbh. Splitting at the right hand side of row 2 (headers) positions except last column which is +1 of final header position. You can do that in Excel with data import — dijksterhuis, Nov 23 '18 at 15:13
@MEdwin Hi, I've updated the data in the question. The "table" were seperated by some timestamp line as shown above — ThanksForHelping, Nov 26 '18 at 03:50
@MartinEvans Hi, As shown above in the question is the exact format of the data that were extracted from the system that I'm processing therefore the format that looks alright but actual fact that the spaces are inconsistent. — ThanksForHelping, Nov 26 '18 at 03:56
Is that one file with multiple sections? Or two example files? — Martin Evans, Nov 26 '18 at 09:30
@MartinEvans it is one file with hundreds of segment seperated by the timestamp line — ThanksForHelping, Nov 26 '18 at 09:35

Martin Evans · Accepted Answer · 2018-11-27T14:14:42.677

I would suggest you use a script to first convert your data into a suitable CSV format that could then be loaded without problem using Pandas.

Most of the data appears to be in a fixed width format, as such the column locations could be hard coded, and strip() applied to remove any whitespace. The time and date can be extracted from the timestamp line and added to the start of each entry within it.

groupby() is used to read the file in blocks, this also has the effect of skipping the unwanted line between segments. It works by reading the file a line at a time and passing each line to a key function (this is provided inline using a lambda function to save space). If the result of the function changes, the groupby function returns a list of lines which have the same return. In this case it is testing if the line does not start with the A=< characters. So you get one return where key is True holding a list of lines starting with the timestamp. Then a list of lines (in this case just one) with the lines starting A=<. This makes it easier to then process a whole segment without having to worry about where it finishes. The first entry is the timestamp followed by all the entries.

A list comprehension is used to extract each individual value from the line and create a row of values. The pairwise() recipe is used to read the fixed column locations from cols to get a sliding start and end column positions. These are used as a string slice to extract the characters for each column. The string then has .strip() applied to remove any surrounding spaces.

from itertools import groupby, tee
import csv

def pairwise(iterable):
    "s -> (s0,s1), (s1,s2), (s2, s3), ..."
    a, b = tee(iterable)
    next(b, None)
    return zip(a, b)

cols = [0, 5, 12, 15, 19, 24, 28, 32, 37, 41, 45, 50, 54, 58, 63, 68, 71, 76] # + rest
header = None

with open('data.txt') as f_input, open('output.csv', 'w', newline='') as f_output:
    csv_output = csv.writer(f_output)

    for key, group in groupby(f_input, lambda x: not x.startswith('A=<')):
        if key:
            # Extract the timestamp from the first returned line in the segment
            # Split the line on spaces and take the first 3 elements
            timestamp = next(group).split()[:3]
            block = []

            for line in group:
                row = [line[start:end].strip() for start, end in pairwise(cols)]
                row.append(line[cols[-1]:].strip())     # Add GHI column
                block.append(timestamp + row)

            if not header:
                header = block[0][3:]
                header[2] = 'Unknown'
                csv_output.writerow(['Day', 'Date', 'Time'] + header)

            csv_output.writerows(block[1:])

For the data you have given, this would give you

Day,Date,Time,IFJ,SC/LY,Unknown,AB,CD?,EF,GH,IJK,LM,NO,PQR,ST,UV,WXY,ZA,BC,DEF,GHI
Tuesday,19-April-2010,00:01,1234,SC 122,,A,20?,31,1,4?,10,3,2?,19,9,5?,-,,-?,30
Tuesday,19-April-2010,00:01,1234,SC 123,',B,60?,11,2,3?,10,5,6?,19,9,4?,17,4,1?,30
Tuesday,19-April-2010,00:01,1234,SC 124,,C,20?,21,2,2?,10,2,1?,9,9,0?,-,,-?,34
Tuesday,19-April-2010,00:01,1234,SC 125,^,1,100?,-,,-?,0,3,3?,-,,-?,-,,-?,10
Tuesday,19-April-2010,00:01,1234,SC 226,*,5,60?,-,,-?,14,4,3?,9,9,2?,7,3,3?,18
Tuesday,19-April-2010,00:03,1234,SC 122,,A,21?,31,5,4?,17,3,2?,19,9,1?,-,,-?,31
Tuesday,19-April-2010,00:03,1234,SC 123,',B,61?,11,2,3?,19,5,6?,19,9,4?,17,4,3?,32
Tuesday,19-April-2010,00:03,1234,SC 124,,C,21?,21,5,2?,10,2,1?,9,9,0?,-,,-?,33
Tuesday,19-April-2010,00:03,1234,SC 125,^,1,101?,-,,-?,7,3,3?,-,,-?,-,,-?,14
Tuesday,19-April-2010,00:03,0000,SC 226,*,5,61?,-,,-?,14,4,3?,9,9,2?,7,3,3?,18

To get a better understanding of how the script actually works, I would recommend you add create a small test file (with two segments) and then also add some print statements.

Hi! Really appreciate your help to set a direction for me to head towards. Was wondering that how can I accurately assign the time? I've tried the code as above but the outcome on the time column only shows the 1st timing across my 10k row of data — ThanksForHelping, Nov 27 '18 at 01:56
Hi, Im so sorry for missing out a line of data that might cause the issue as commented above, I've updated the data in the question — ThanksForHelping, Nov 27 '18 at 02:25
I have updated the code to spot that line. It was definitely causing the problem. — Martin Evans, Nov 27 '18 at 07:08
Thank you so much for your help. Would appreciate it if you can include some comments so that I can better learn and understand on your thought process for this. Cheers! — ThanksForHelping, Nov 27 '18 at 08:05
I have improved the explanation a bit, but there are quite a few concepts in use which are worth understanding. Try adding some print statements to help see what each bit is doing. Hopefully this helps. — Martin Evans, Nov 27 '18 at 14:16
Would like to check why that set of number were used for cols? — ThanksForHelping, Nov 28 '18 at 01:52
Each number is the horizontal offset of where the columns start. So the first column is taken from `0` to `5`. The next column is taken from `5` to `12`. — Martin Evans, Nov 28 '18 at 07:31

score 0 · Answer 2 · answered Nov 23 '18 at 10:07

0

Your second row seems to be the actual header. But then still, it doesn't look right, as MEdwin pointed out.

Normally, you could try:

import pandas as pd

df = pd.read_csv(<your csv), skiprows=1, delim_whitespace=True)
print(df.head())

But this will not work correctly, because the CSV isn't structured consistently.

answered Nov 23 '18 at 10:07

Niels Henkens

2,553
1
12
27

who is marking this down without comments? Not helpful to anyone (please). – MEdwin Nov 26 '18 at 09:18

dijksterhuis · Answer 3 · 2018-11-23T14:20:40.417

EDIT: This is the closest I can get with regex. Unfortunately any blank fields (e.g. column GH rows 4&5), will not be processed correctly and the data will be skewed.

Please note I’m on an iPhone so you will need to replace any and all double quotes (“).

The regex expression [^0-9A-Z\?\-]+ will match any characters that are NOT:

digits 0-9
capitalised letters
question marks
hyphens

For one or more combinations...

This removes the weird special characters and deals with the multi white space delimiters.

You will end up with an extra blank column at the start. You can deal with that with some more regex, but I’ll leave that up to you for some homework (hungover).

The column SC/LY will be split into two. With two stages of delimiting (not using pandas to start) you can probably handle that.

>>> import re
>>> regex_string = r”[^0-9A-Z\?\-\/]+”
>>> p=re.compile(regex_string)
>>> header
' IFJ   SC/LY     AB  CD?  EF  GH  IJK  LM  NO  PQR  ST  UV  WXY  ZA  BC  DEF GHI'
>>> single_row
' 1234 SC 125 ^    1 101?   -       -?   7   3   3?   -       -?   -       -?   10'
>>> p.split(header)
['', 'IFJ', 'SC' , 'LY', 'AB', 'CD?', 'EF', 'GH', 'IJK', 'LM', 'NO', 'PQR', 'ST', 'UV', 'WXY', 'ZA', 'BC', 'DEF', 'GHI']
>>> p.split(single_row)
['', '1234', 'SC', '125', '1', '101?', '-', '-?', '7', '3', '3?', '-', '-?', '-', '-?', '10']

With pandas, that would look like:

>>> import pandas
>>> df = pandas.read_csv(filepath, skiprows=1, header=True, delimiter= r”[^0-9A-Z\?\-]+”, engine=“python”)

Structuring of messy data using python

3 Answers3