Convert awkwardly formatted data into a dataframe with pandas python

Question

I have some data in the following csv format:


Variable 1

Time	Value
Time1	12
Time2	32
Time3	4
Time4	5
Time5	34
Time6	5
Time7	46
Time8	7
Time9	8
Time10	543
Variable 2

Time	Value
Time1	1	2	3
Time2	2	45	5
Time3	4	2	54
Time4	3	1	2
Time5	3	2	4
Time6	4	5	8
Time7	4	7	4
Time8	8	65	12
Time9	12	8	14
Time10	65	65	13
Variable 3

Time	Value
Time1	3
Time2	4
Time3	5
Time4	2
Time5	1
Time6	7
Time7	5
Time8	3
Time9	5
Time10	7

And want to put it in the following data frame format with pandas:

         Variable1    Variable2    Variable3    
Time1           12     [1,2,3]             3
Time2           32    [2,45,5]             4
Time3            4    [4,2,54]             5

How would I even go about this? I know the format is awful, don't ask me why it's like that but I'm kinda stuck with it. I really don't even know where to begin with this one. TIA

Load it into temporary DF, Dropna in Col1, later, add new column and if Col1 has string `Variable*`, copy cell value to the nwe column, - later fffill down the column, the rest is just pivot or unstack and etc. — NoobVB, May 05 '22 at 10:18
check it in .txt format and try importing data frame from .txt, - looks way easier — NoobVB, May 05 '22 at 12:41
same error: Error tokenizing data. C error: Expected 2 fields in line 36, saw 91 — twoface, May 05 '22 at 13:44

NoobVB · Accepted Answer · 2022-05-05T19:16:20.590

1

updated code as per comments

initial file read is based on code from here:

import pandas as pd
import numpy as np

file = r'C:\Test\TIMBER-1100-10M.csv'
# Loop the data lines
with open(file, 'r') as temp_f:
    # get No of columns in each line
    col_count = [ len(l.split(",")) for l in temp_f.readlines() ]

# Generate column names  (names will be 0, 1, 2, ..., maximum columns - 1)
column_names = [i for i in range(0, max(col_count))]
df = pd.read_csv(file, header=None, delimiter=",", names=column_names)

# preparing dataframe for pivot
df['Variable'] = np.where(df[0].str.contains('VARIABLE:'), df[0], np.nan)
df['Variable'].ffill(inplace=True)
df[1].dropna(inplace=True)
drop_values = ['Timestamp','VARIABLE:']
df2 = df[~df[0].str.contains('|'.join(drop_values))].astype({col: str for col in df.columns[2:]})

conc_col = df2.columns.to_list()
conc_col.remove(0)
conc_col.remove('Variable')

df2['Value'] = df2[conc_col].apply(lambda x: ','.join(x.dropna()), axis=1).str.strip(',nan')
df2.rename(columns={ df.columns[0]: "Time" }, inplace = True)

# creating the pivot as final dataframe
pivot = df2.pivot_table(index=['Time'],
                        columns=['Variable'],
                        values='Value',
                        aggfunc='sum')\
                        .rename_axis(None, axis=1)\
                        .reset_index()
pivot.to_excel(r'C:\Test\temp1.xlsx')

edited May 05 '22 at 19:16

answered May 05 '22 at 13:30

NoobVB

989
6
10

Yes, unfortunately the data underneath is the following blocks variable name, ie the title of the data frame column. Running that gives me: TypeError: read_csv() got an unexpected keyword argument 'on_bad_lines' – twoface May 05 '22 at 13:43
not sure why you get the error, maybe update the above question with your code, as pandas docs https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html do have parameter `on_bad_lines='skip'` – NoobVB May 05 '22 at 13:58
Yup, apparently we are stuck using pandas 1.2.2 which does not have this feature. – twoface May 05 '22 at 14:15
as per docs, try the old pandas param: `, error_bad_lines=False, warn_bad_lines=False` instead of one `on_bad_lines='skip'` – NoobVB May 05 '22 at 14:20
updated my answer, - just in case – NoobVB May 05 '22 at 14:29
Okay yea that works with my version. The result is close but its missing the titles and is just a 2 column data frame, it needs a column for each variable. I'm starting to think this is impossible and I need to come up with something different. I am having no luck either. – twoface May 05 '22 at 14:35
this option made it work for the start: https://stackoverflow.com/questions/27020216/import-csv-with-different-number-of-columns-per-row-using-pandas/57824142#57824142 – NoobVB May 05 '22 at 15:07
and the rest, as per my very first comment. – NoobVB May 05 '22 at 15:09
Yeah this is the closest I got too, I used a different method: `df = pd.read_csv('TIMBER-1100-10M_easy.csv', header=None, sep='\n') df = df[0].str.split(',', expand=True)` but it's still not quite there – twoface May 05 '22 at 15:11
Although the one you linked is better because that gives me NaNs instead of None throughout my df, but it still doesn't have the correct column names, but I think I might be able to go from the linked df to the correct one and post the solution, thanks! – twoface May 05 '22 at 15:12
Actually I was wrong, that's no where near the solution I need. – twoface May 05 '22 at 15:20
the linked code gives a quite nice starting dataframe, the rest is just small adjustments and pivot, I can try to write the full solution, if you will explain how you want to treat the bottom line as it has 89 duplicated values in the `Array Values` column and only 2 unique – NoobVB May 05 '22 at 15:27
Updated the code above as per comments – NoobVB May 05 '22 at 19:13

Convert awkwardly formatted data into a dataframe with pandas python

1 Answers1