0

I have an headerless & unstructured csv file where the first two records conform to a fixed layout but from the third row on wards the column numbers vary.

100,NEM12,202006011242,TCAUSTM,PROGREEN 
200,VCCCCB0100,E1K1Q1,001,E1,N1,006668,KWH,15,
300,20200406,122.000,120.000,120.000,122.000,120.000,120.000,120.000,122.000,120.000,120.000,120.000,122.000,120.000,120.000,122.000,120.000,120.000,122.000,120.000,120.000,122.000,120.000,120.000,120.000,122.000,120.000,120.000,120.000,122.000,120.000,120.000,120.000,120.000,122.000,120.000,120.000,122.000,120.000,120.000,122.000,120.000,120.000,122.000,120.000,120.000,120.000,120.000,122.000,120.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,0.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,0.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,0.000,2.000,2.000,2.000,2.000,2.000,2.000,0.000,2.000,2.000,2.000,2.000,2.000,2.000,0.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,F16,28,,20200601113727,
300,20200407,2.000,2.000,2.000,2.000,2.000,2.000,0.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,0.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,0.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,0.000,4.000,0.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,0.000,4.000,0.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,0.000,F16,28,,20200601113727,

I have tried transposing the rows into columns as suggested here which produces an output as below

enter image description here

But ideally I would process the first two records as two columnar dataframes but transpose the 300 record as above, except that I would like to append VCCCCB0100, E1, 006668 & KWH added from the dataframe of 200 such that the transposed dataframe appears as below for every 300 record. The 300 dataframe borrows a lot of columns from the 200 record.

enter image description here

Balajee Addanki
  • 690
  • 2
  • 9
  • 23

1 Answers1

1

This I think you have to be systematic about building up structure you want. I can't see anyway to get from_csv() to work

  1. load the CSV data directly from a file and use a list comprehension to generate a 2D list
  2. concat() columns from first & second rows. Date from remaining rows, plus array of readings and array of interval_no
  3. pd.Series.explode array in step 2 explode multiple lists
  4. finally cleanup, naming columns and dropping unwanted ones
data = '''100,NEM12,202006011242,TCAUSTM,PROGREEN 
200,VCCCCB0100,E1K1Q1,001,E1,N1,006668,KWH,15,
300,20200406,122.000,120.000,120.000,122.000,120.000,120.000,120.000,122.000,120.000,120.000,120.000,122.000,120.000,120.000,122.000,120.000,120.000,122.000,120.000,120.000,122.000,120.000,120.000,120.000,122.000,120.000,120.000,120.000,122.000,120.000,120.000,120.000,120.000,122.000,120.000,120.000,122.000,120.000,120.000,122.000,120.000,120.000,122.000,120.000,120.000,120.000,120.000,122.000,120.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,0.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,0.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,0.000,2.000,2.000,2.000,2.000,2.000,2.000,0.000,2.000,2.000,2.000,2.000,2.000,2.000,0.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,F16,28,,20200601113727,
300,20200407,2.000,2.000,2.000,2.000,2.000,2.000,0.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,0.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,0.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,0.000,4.000,0.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,0.000,4.000,0.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,2.000,0.000,F16,28,,20200601113727,'''

with open("sample.csv", "w") as f:
    f.write(data)

with open("sample.csv") as f:
    data = f.read()

arr = [l.split(",") for l in data.split("\n")]    
df = pd.concat([
    pd.DataFrame(arr[0][1:]).T, # pick columns from first row - exclude 100
    pd.DataFrame(arr[1][1:]).T, # pick columns from second row - exclude 200
    # pick the interval_date (exclude 300) and then turn all other columns into an array
    pd.DataFrame(arr[2:]).iloc[:,1:2]\
        .assign(Reading=pd.DataFrame(arr[2:]).iloc[:,2:].values.tolist(),
               Int_No=[[i+1 for i,e in enumerate(l)] for l in pd.DataFrame(arr[2:]).iloc[:,2:].values.tolist()])
    ], axis=1).fillna(method="ffill") # there are two readings rows fill second row
df.columns = [i for i,c in enumerate(df.columns)] # clean up the column numbers
# explode the reading, then name columns and drop redundant ones
# https://stackoverflow.com/questions/45846765/efficient-way-to-unnest-explode-multiple-list-columns-in-a-pandas-dataframe
df = df.set_index(list(df.columns[:-2])).apply(pd.Series.explode).reset_index()\
    .rename({14:"Reading",13:"Interval_date",11:"Interval_length",4:"NEMI",7:"NMI_Suffix",
            10:"UOM",15:"Interval_No"}, axis=1)\
    .drop([0,1,2,3,5,6,8,9,12], 1)

print(df[:10].to_string(index=False))

output

       NEMI NMI_Suffix  UOM Interval_length Interval_date  Reading Interval_No
 VCCCCB0100         E1  KWH              15      20200406  122.000           1
 VCCCCB0100         E1  KWH              15      20200406  120.000           2
 VCCCCB0100         E1  KWH              15      20200406  120.000           3
 VCCCCB0100         E1  KWH              15      20200406  122.000           4
 VCCCCB0100         E1  KWH              15      20200406  120.000           5
 VCCCCB0100         E1  KWH              15      20200406  120.000           6
 VCCCCB0100         E1  KWH              15      20200406  120.000           7
 VCCCCB0100         E1  KWH              15      20200406  122.000           8
 VCCCCB0100         E1  KWH              15      20200406  120.000           9
 VCCCCB0100         E1  KWH              15      20200406  120.000          10
Rob Raymond
  • 29,118
  • 3
  • 14
  • 30