How can python handle a messy data format in pandas?

Question

Sometimes I see data posted in a Stack Overflow question formatted like in this question. This is not the first time, so I have decided to ask a question about this topic.

I will post the data sample from the link here just in case the question will be deleted.

      itm Date                  Amount 
67    420 2012-09-30 00:00:00   65211
68    421 2012-09-09 00:00:00   29424
69    421 2012-09-16 00:00:00   29877
70    421 2012-09-23 00:00:00   30990
71    421 2012-09-30 00:00:00   61303
72    485 2012-09-09 00:00:00   71781
73    485 2012-09-16 00:00:00     NaN
74    485 2012-09-23 00:00:00   11072
75    485 2012-09-30 00:00:00  113702
76    489 2012-09-09 00:00:00   64731
77    489 2012-09-16 00:00:00     NaN

As you can see this is not the right way to post reproducible data.

What can Python code do to make that table usable?

Reading it directly with pandas might be problematic, but since the data is structured it should be no problem after some simple pre processing. For your example you could use something like this as a starting point `indices = [0, 6, 10, 29]; s = "67 420 2012-09-30 00:00:00 65211"; [s[i:j] for i,j in zip(indices, indices[1:]+[None])]` Result is an array with each value as a separate entry — Markus, Mar 11 '22 at 10:51

score 2 · Answer 1 · answered Mar 12 '22 at 15:35

The way I proceed when I have to deal with this kind of data, which is indeed frequent in SO posts, is to:

first, copy (ctrl+c or right click) the data except the header row ( itm Date Amount in your example);
then, run the following code:

import pandas as pd

df = pd.read_clipboard()

# save data which was misinterpreted as column header
first_row = df.columns.to_list()

# rename column headers as 0, 1, 2, ...
df.columns = range(df.shape[1])

# Insert first row, sort dataframe, then get rid of first column while saving to csv
df = (
    pd.concat([pd.DataFrame(first_row).T, df], ignore_index=True)
    .set_index(0)
    .to_csv("temp.csv", index=False)
)

So that:

import pandas as pd

df = pd.read_csv("temp.csv")
print(df)
# Output
      1           2         3         4
0   420  2012-09-30  00:00:00   65211.0
1   421  2012-09-09  00:00:00   29424.0
2   421  2012-09-16  00:00:00   29877.0
3   421  2012-09-23  00:00:00   30990.0
4   421  2012-09-30  00:00:00   61303.0
5   485  2012-09-09  00:00:00   71781.0
6   485  2012-09-16  00:00:00       NaN
7   485  2012-09-23  00:00:00   11072.0
8   485  2012-09-30  00:00:00  113702.0
9   489  2012-09-09  00:00:00   64731.0
10  489  2012-09-16  00:00:00       NaN

From there, you can do some additional processing, depending on the data you are dealing with.

In your example:

df = df.drop(columns="3").rename(columns={"1": "itm", "2": "Date", "4": "Amount"})

print(df)
# Output
    itm        Date    Amount
0   420  2012-09-30   65211.0
1   421  2012-09-09   29424.0
2   421  2012-09-16   29877.0
3   421  2012-09-23   30990.0
4   421  2012-09-30   61303.0
5   485  2012-09-09   71781.0
6   485  2012-09-16       NaN
7   485  2012-09-23   11072.0
8   485  2012-09-30  113702.0
9   489  2012-09-09   64731.0
10  489  2012-09-16       NaN

JAdel · Answer 2 · 2022-03-16T20:32:44.920

I'm not a fan of answering my own questions, but I've found a good approach using a combination of StringIO and Regex and would like to share it. First, I copy the table without the header. Then I remove the whitespaces in regex and replace them with commas. Then I read the string into a StringIO object. Finally I read the StringIO object with pandas and rename the columns.

import pandas as pd
import os,sys,io, re

temp = """
67    420 2012-09-30 00:00:00   65211
68    421 2012-09-09 00:00:00   29424
69    421 2012-09-16 00:00:00   29877
70    421 2012-09-23 00:00:00   30990
71    421 2012-09-30 00:00:00   61303
72    485 2012-09-09 00:00:00   71781
73    485 2012-09-16 00:00:00     NaN
74    485 2012-09-23 00:00:00   11072
75    485 2012-09-30 00:00:00  113702
76    489 2012-09-09 00:00:00   64731
77    489 2012-09-16 00:00:00     NaN"""

temp2 = re.sub(r" +", r",", temp)
temp2 = io.StringIO(temp2) 
df = pd.read_csv(temp2, sep=",",header=None)
df = df.rename(columns={0:"id",1:"itm",2:"Date",3:"Time",4:"Amount"})

print(df)

    id  itm        Date      Time    Amount
0   67  420  2012-09-30  00:00:00   65211.0
1   68  421  2012-09-09  00:00:00   29424.0
2   69  421  2012-09-16  00:00:00   29877.0
3   70  421  2012-09-23  00:00:00   30990.0
4   71  421  2012-09-30  00:00:00   61303.0
5   72  485  2012-09-09  00:00:00   71781.0
6   73  485  2012-09-16  00:00:00       NaN
7   74  485  2012-09-23  00:00:00   11072.0
8   75  485  2012-09-30  00:00:00  113702.0
9   76  489  2012-09-09  00:00:00   64731.0
10  77  489  2012-09-16  00:00:00       NaN

How can python handle a messy data format in pandas?

2 Answers2