1

I have a CSV file which contain unmatched columns, (i.e) first row will have 8 columns and second row will have 10 columns like below,

input.csv,

B|prem|29|get|get1|get2|get3|
T|prem1|30|get|get1|get2|get3|get4|get5|get6
B|prem2|31|get|get1|get2|get3|
T|prem1|30|get|get1|get2|get3|get4|get5|get6
T|prem1|30|get|get1|get2|get3|get4|get5|get6
B|prem2|31|get|get1|get2|get3|

if you look out above CSV has two different columns size within single csv file, While i am trying to parse this data getting parser-tokenize error.

My expected output is,

i should get two dataframe out of this input.csv file like below,

dataframe 1:

 B|prem|29|get|get1|get2|get3|
 B|prem2|31|get|get1|get2|get3|
 B|prem2|31|get|get1|get2|get3|

dataframe 2:

T|prem1|30|get|get1|get2|get3|get4|get5|get6
T|prem1|30|get|get1|get2|get3|get4|get5|get6
T|prem1|30|get|get1|get2|get3|get4|get5|get6

Code which tried for,

import pandas as pd
df = pd.read_csv('input.csv',sep='|',header=None)

Faced Tokenzise error during the execution.

Can someone help me out on this ? Thanks in Advance.

prem Rexx
  • 93
  • 10
  • See the tips on cleaning data before passing to pandas on the accepted answer here: https://stackoverflow.com/questions/34358196/read-csv-with-missing-incomplete-header-or-irregular-number-of-columns – user3468054 Apr 20 '21 at 09:59
  • Thanks for quick response. I already gone through it. But that is not exact one i'm looking for. @user3468054 – prem Rexx Apr 20 '21 at 10:11

1 Answers1

3

You can try itertools.groupby()

import itertools
import pandas as pd
from io import StringIO

with open('test.csv') as f:
    lines = f.read().splitlines()

for key, group in itertools.groupby(sorted(lines, key=lambda x: x[0]), lambda x: x[0]):
    df = pd.read_csv(StringIO('\n'.join(group)), sep='|', header=None).dropna(axis=1)
    print(df)
   0      1   2    3     4     5     6
0  B   prem  29  get  get1  get2  get3
1  B  prem2  31  get  get1  get2  get3
2  B  prem2  31  get  get1  get2  get3

   0      1   2    3     4     5     6     7     8     9
0  T  prem1  30  get  get1  get2  get3  get4  get5  get6
1  T  prem1  30  get  get1  get2  get3  get4  get5  get6
2  T  prem1  30  get  get1  get2  get3  get4  get5  get6

You can also create dataframe without StringIO

df = pd.DataFrame([ele.split('|') for ele in group]).replace('', np.nan).dropna(axis=1)
Ynjxsjmh
  • 28,441
  • 6
  • 34
  • 52