I have a very large dataframe and I want to generate unique values from each column. This is just a sample-- there are over 20 columns in total.
CRASH_DT CRASH_MO_NO CRASH_DAY_NO
1/1/2013 01 01
1/1/2013 01 01
1/5/2013 03 05
My desired output is like so:
<variable = "CRASH_DT">
<code>1/1/2013</code>
<count>2</count>
<code>1/5/2013</code>
<count>1</count>
</variable>
<variable = "CRASH_MO_NO">
<code>01</code>
<count>2</count>
<code>03</code>
<count>1</count>
</variable>
<variable = "CRASH_DAY_NO">
<code>01</code>
<count>2</count>
<code>05</code>
<count>1</count>
</variable>
I have been trying to use the .sum() or .unique() functions, as suggested by many other questions about this topic that I have already looked at.
None of them seem to apply to this problem, and all of them say that in order to generate unique values from every column, you should either use a groupby function, or select individual columns. I have a very large number of columns (over 20), so it doesn't really make sense to group them together just by writing out df.unique['col1','col2'...'col20']
I have tried .unique(), .value_counts(), and .count, but I can't figure out how to apply any of those to work across multiple columns, rather than a groupby function or anything that was suggested in the above links.
My question is: how can I generate a count of unique values from each of the columns in a truly massive dataframe, preferably by looping through the columns themselves? (I apologize if this is a duplicate, I have looked through a whole lot of questions on this topic and while they seem like they should work for my problem as well, I can't figure out exactly how to tweak them to get them to work for me.)
This is my code so far:
import pyodbc
import pandas.io.sql
conn = pyodbc.connect('DRIVER={Microsoft Access Driver (*.mdb, *.accdb)};DBQ=C:\\Users\\<filename>.accdb')
sql_crash = "SELECT * FROM CRASH"
df_crash = pandas.io.sql.read_sql(sql_crash, conn)
df_c_head = df_crash.head()
df_c_desc = df_c_head.describe()
for k in df_c_desc:
df_c_unique = df_c_desc[k].unique()
print(df_c_unique.value_counts()) #Generates the error "numpy.ndarray object has no attribute .value_counts()