0

I have a CSV file with a lot of rows and different number of columns.

How to group data by count of columns and show it in different frames?

File CSV has the following data:

1 OLEG US FRANCE BIG
1 OLEG FR 18
1 NATA 18

Because I have different number of colums in each row I have to group rows by count of columns and show 3 frames to be able set header then:

        ID NAME  STATE COUNTRY HOBBY 
   FR1: 1  OLEG    US   FRANCE  BIG

        ID NAME  COUNTRY AGE
   FR2: 1   OLEG   FR    18


  FR3:  
     ID  NAME AGE
     1  NATA    18

Any words, I need to group rows by count of columns and show them in different dataframes.

Rabinzel
  • 7,757
  • 3
  • 10
  • 30
George
  • 59
  • 7
  • Welcome to stack overflow! Unfortunately your question is not very clear. Please have a look at [How to make good pandas examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) and [edit] your question to include a [mcve] showing sample input, expected output, and code for what you've tried so far based on your own research. This will help us better understand how to help you – G. Anderson May 23 '22 at 22:19
  • I have added some details – George May 23 '22 at 22:23
  • pandas doesn't tolerate variable-width rows, so it's not obvious to us what your dataframe looks like. Can you show us the code you have so far, and the results of `print(df)` at the point that you'd like to do this? That would help us understand where you're starting from. – Michael Delgado May 23 '22 at 22:29
  • I showed the data in question it is three lines with different number of colums. For each column I have to determine the column name – George May 23 '22 at 22:34
  • The same question is here: https://discuss.dizzycoding.com/import-csv-with-different-number-of-columns-per-row-using-pandas/ – George May 23 '22 at 23:01

1 Answers1

1

since pandas doesn't allow you to have different length of columns, just don't use it to import your data. Your goal is to create three seperate df, so first import the data as lists, and then deal with it and its differents lengths.

One way to solve this is read the data with csv.reader and create the df's with list comprehension together with a condition for the length of the lists.

with open('input.csv', 'r') as f:
    reader = csv.reader(f, delimiter=' ')
    data= list(reader)
    
df1 = pd.DataFrame([item for item in data if len(item)==3], columns='ID NAME AGE'.split())
df2 = pd.DataFrame([item for item in data if len(item)==4], columns='ID NAME COUNTRY AGE'.split())
df3 = pd.DataFrame([item for item in data if len(item)==5], columns='ID NAME STATE COUNTRY HOBBY'.split())
    
print(df1, df2, df3, sep='\n\n')

  ID  NAME AGE
0  1  NATA  18

  ID  NAME COUNTRY AGE
0  1  OLEG      FR  18

  ID  NAME STATE COUNTRY HOBBY
0  1  OLEG    US  FRANCE   BIG

If you need to hardcode too many lines for the same step (e.g. too many df's), then you should consider using a loop to create them and store each dataframe as key/value in a dictionary.

EDIT Here is the little optimizedway of creating those df's. I think you can't get around creating a list of columns you want to use for the seperate df's, so you need to know what variations of number of columns you have in your data (except you want to create those df's without naming the columns.

col_list=[['ID', 'NAME', 'AGE'],['ID', 'NAME', 'COUNTRY', 'AGE'],['ID', 'NAME', 'STATE', 'COUNTRY', 'HOBBY']]

with open('input.csv', 'r') as f:
    reader = csv.reader(f, delimiter=' ')
    data= list(reader)

dict_of_dfs = {}
for cols in col_list:
    dict_of_dfs[f'df_{len(cols)}'] = pd.DataFrame([item for item in data if len(item)==len(cols)], columns=cols)
    
for key,val in dict_of_dfs.items():
    print(f'{key=}: \n {val} \n')

key='df_3': 
   ID  NAME AGE
0  1  NATA  18 

key='df_4': 
   ID  NAME COUNTRY AGE
0  1  OLEG      FR  18 

key='df_5': 
   ID  NAME STATE COUNTRY HOBBY
0  1  OLEG    US  FRANCE   BIG 

Now you don't have variables for your df, instead you have them in a dictionary as keys. (I named the df with the number of columns it has, df_3 is the df with three columns.

If you need to import the data with pandas, you could have a look at this post.

Rabinzel
  • 7,757
  • 3
  • 10
  • 30