0

How can I create a single row and get the data type, maximum column length and count for each column of a data frame as shown in bottom desired output section.

import pandas as pd

table = 'sample_data'
idx=0


#Create a Dictionary of series
d = {'Name':pd.Series(['Tom','Ricky','Vin','Steve','Smith','Jack',
   'Lee','David','Gasper','Betina','Andres']),
   'Age':pd.Series([25,26,25,23,30,29,23,'NULL',40,30,51,46]),
   'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65]),
   'new_column':pd.Series([])
}

#Create a DataFrame using above data
sdf = pd.DataFrame(d)

#Create a summary description
desired_data = sdf.describe(include='all').T
desired_data = desired_data.rename(columns={'index':'Variable'})
#print(summary)

#Get Data Type
dtype = sdf.dtypes
#print(data_type)

#Get total count of records (need to work on)
counts = sdf.shape[0] # gives number of row count

#Get maximum length of values
maxcollen = []
for col in range(len(sdf.columns)):
    maxcollen.append(max(sdf.iloc[:,col].astype(str).apply(len)))
#print('Max Column Lengths ', maxColumnLenghts)


#Constructing final data frame
desired_data = desired_data.assign(data_type = dtype.values)
desired_data = desired_data.assign(total_count = counts)
desired_data = desired_data.assign(max_col_length = maxcollen)

final_df = desired_data

final_df = final_df.reindex(columns=['data_type','max_col_length','total_count'])
final_df.insert(loc=idx, column='table_name', value=table)


final_df.to_csv('desired_data.csv')
#print(final_df)

Output of above code: enter image description here

The desired output I am looking for is : enter image description here

In : sdf
Out:
table_name     Name_data_type   Name_total_count  Name_max_col_length Age_data_type   Age_total_count  Age_max_col_length     Rating_data_type   Rating_total_count  Rating_max_col_length
sample_data    object           12                6                   object          12                4                     float64            12                  4

If you have noticed, I want to print single row where I create column_name_data_type,column_name_total_count,column_name_max_col_length and get the respective values for the same.

2 Answers2

0

Here's a solution:

df = final_df
df = df.drop("new_column").drop("table_name", axis=1)
df = df.reset_index()
df.melt(id_vars=["index"]).set_index(["index", "variable"]).sort_index().transpose()

The result is:

index          Age                                 Name                 \
variable data_type max_col_length total_count data_type max_col_length  ... 
value       object              4          12    object              6  ... 
Roy2012
  • 11,755
  • 2
  • 22
  • 35
0

Can you try this:

The below code tries to iterate entire dataframe, hence it may take some time complexity. This is not the optimal solution but working solution for the above problem.

from collections import OrderedDict

## storing key-value pair
result_dic = OrderedDict()
unique_table_name = final_df["table_name"].unique()
# remove unwanted rows
final_df.drop("new_column", inplace=True)
cols_name = final_df.columns
## for every unique table name, generating row
for unique_table_name in unique_table_name:
    result_dic["table_name"] = unique_table_name
    filtered_df = final_df[final_df["table_name"] == unique_table_name]
    for row in filtered_df.iterrows():        
        for cols in cols_name:
            if cols != "table_name":
                result_dic[row[0]+"_"+cols] = row[1][cols]

Convert dict to dataframe

## convert dataframe from dict
result_df = pd.DataFrame([result_dic])
result_df

expected output is:

table_name  Name_data_type  Name_max_col_length     Name_total_count    Age_data_type   Age_max_col_length  Age_total_count     Rating_data_type    Rating_max_col_length   Rating_total_count
0   sample_data     object  6   12  object  4   12  float64     4   12
Narendra Prasath
  • 1,501
  • 1
  • 10
  • 20
  • if you don't need any column, you can set `cols_name = final_df.columns` value whichever column you needed. This operation might take some time to compute since will be iterating entire dataset once. Let me know if you face any challange in my solution – Narendra Prasath Jun 21 '20 at 08:57
  • This solution is working absolutely fine. Thank you. However, how do I remove the beginning index=0th column. Column name is blank and value is 0. – AWSGeekCoder Jun 21 '20 at 14:48
  • Thanks. `0` would be index of the row. if you need to change the name of the index then should use `result_df.index = ["name"]` this list length should match the dataframe no of rows. If it solved your question. Please upvote and mark as correct answer, it would help others as well – Narendra Prasath Jun 21 '20 at 14:54
  • What exactly do you need? `How do I eliminate showing the index` can you bit elaborate this?. What I understand is when you are printing the dataframe you dont want to show the index value right?. then you could try this `result_df = result_df.style.hide_index()` then `print(result_df)`. I am not sure, what is the implication of this? – Narendra Prasath Jun 21 '20 at 15:03
  • Yes correct ! I want to store it in a csv and I dont want to store it with index values. I tried this, but got this error : 'Styler' object has no attribute 'to_csv' – AWSGeekCoder Jun 21 '20 at 15:12
  • oh. then you should not use `result_df.style.hide_index()` when you are saving to csv you can set `index as False` `result_df.to_csv('output.csv', index=False)` – Narendra Prasath Jun 21 '20 at 15:13