2

I have a very large dataframe and I want to generate unique values from each column. This is just a sample-- there are over 20 columns in total.

          CRASH_DT        CRASH_MO_NO     CRASH_DAY_NO
          1/1/2013        01              01    
          1/1/2013        01              01
          1/5/2013        03              05

My desired output is like so:

<variable = "CRASH_DT">
   <code>1/1/2013</code>
   <count>2</count>
   <code>1/5/2013</code>
   <count>1</count>
</variable>
<variable = "CRASH_MO_NO">
   <code>01</code>
   <count>2</count>
   <code>03</code>
   <count>1</count>
</variable>
<variable = "CRASH_DAY_NO">
   <code>01</code>
   <count>2</count>
   <code>05</code>
   <count>1</count>
</variable>

I have been trying to use the .sum() or .unique() functions, as suggested by many other questions about this topic that I have already looked at.

None of them seem to apply to this problem, and all of them say that in order to generate unique values from every column, you should either use a groupby function, or select individual columns. I have a very large number of columns (over 20), so it doesn't really make sense to group them together just by writing out df.unique['col1','col2'...'col20']

I have tried .unique(), .value_counts(), and .count, but I can't figure out how to apply any of those to work across multiple columns, rather than a groupby function or anything that was suggested in the above links.

My question is: how can I generate a count of unique values from each of the columns in a truly massive dataframe, preferably by looping through the columns themselves? (I apologize if this is a duplicate, I have looked through a whole lot of questions on this topic and while they seem like they should work for my problem as well, I can't figure out exactly how to tweak them to get them to work for me.)

This is my code so far:

import pyodbc
import pandas.io.sql

conn = pyodbc.connect('DRIVER={Microsoft Access Driver (*.mdb, *.accdb)};DBQ=C:\\Users\\<filename>.accdb')

sql_crash = "SELECT * FROM CRASH"
df_crash = pandas.io.sql.read_sql(sql_crash, conn)
df_c_head = df_crash.head()
df_c_desc = df_c_head.describe()

for k in df_c_desc:
   df_c_unique = df_c_desc[k].unique()
   print(df_c_unique.value_counts()) #Generates the error "numpy.ndarray object has no attribute .value_counts() 
Community
  • 1
  • 1
ale19
  • 1,327
  • 7
  • 23
  • 38
  • 1
    Doesn't `df_crash.apply(pd.Series.value_counts)` work? – EdChum Sep 15 '15 at 14:52
  • I imagine it would, but I don't know where to start with it. If I understand it correctly, pd.Series refers to each column, so I think I need to loop through each column somehow? Is that correct? – ale19 Sep 15 '15 at 16:29
  • I have posted an answer to illustrate this use in your case. – Romain Sep 15 '15 at 16:47

2 Answers2

5

I would loop over value_counts().items() per column:

>>> df["CRASH_DAY_NO"].value_counts()
01    2
05    1
dtype: int64
>>> df["CRASH_DAY_NO"].value_counts().items()
<zip object at 0x7fabf49f05c8>
>>> for value, count in df["CRASH_DAY_NO"].value_counts().items():
...     print(value, count)
...     
01 2
05 1

So something like

def vc_xml(df):
    for col in df:
        yield '<variable = "{}">'.format(col)
        for k,v in df[col].value_counts().items():
            yield "   <code>{}</code>".format(k)
            yield "   <count>{}</count>".format(v)
        yield '</variable>'

with open("out.xml", "w") as fp:
    for line in vc_xml(df):
        fp.write(line + "\n")

gives me

<variable = "CRASH_DAY_NO">
   <code>01</code>
   <count>2</count>
   <code>05</code>
   <count>1</count>
</variable>
<variable = "CRASH_DT">
   <code>1/1/2013</code>
   <count>2</count>
   <code>1/5/2013</code>
   <count>1</count>
</variable>
<variable = "CRASH_MO_NO">
   <code>01</code>
   <count>2</count>
   <code>03</code>
   <count>1</count>
</variable>
DSM
  • 342,061
  • 65
  • 592
  • 494
  • Wow, this is perfect! Works great when I just use df.head(). Thank you very much. However, when I run the program (with some adjustments) over the whole dataframe (not just the columns listed above), it still gives me those RuntimeErrors-- and it seems to hang without generating an XML file. I listed my code, the errors, and the data that I think might be causing the problem here: http://pastebin.com/ZwcXEkxQ – ale19 Sep 16 '15 at 14:57
  • Belay my last comment! Running it from the command line doesn't seem to produce any problems, and generates exactly what I need. Maybe it's an issue with my Anaconda install or something. I really appreciate this, I was completely stuck and this was driving me nuts. – ale19 Sep 16 '15 at 16:47
2

Here is an answer inspired by the answer to this question. But I don't know if it will be scalable enough in your case.

df = pd.DataFrame({'CRASH_DAY_NO': [1, 1, 5, 2, 2],
 'CRASH_DT': ['10/2/2014 5:00:08 PM',
  '5/28/2014 1:29:28 PM',
  '5/28/2014 1:29:28 PM',
  '7/14/2014 5:42:03 PM',
  '6/3/2014 10:33:22 AM'],
 'CRASH_ID': [1486150, 1486152, 1486224, 1486225, 1486226],
 'SEG_PT_LRS_MEAS': [79.940226960000004,
  297.80989999000002,
  140.56460290999999,
  759.43600000000004,
  102.566036],
 'SER_NO': [1, 3, 4, 5, 6]})

df = df.apply(lambda x: x.value_counts(sort=False))
df.index = df.index.astype(str)
# Transforming to XML by hand ...
def func(row):
    xml = ['<variable = "{0}">'.format(row.name)]
    for field in row.index:
        if not pd.isnull(row[field]):
            xml.append('  <code>{0}</code>'.format(field))
            xml.append('  <count>{0}</count>'.format(row[field]))
    xml.append('</variable>')
    return '\n'.join(xml)

print('\n'.join(df.apply(func, axis=0)))

<variable = "CRASH_DAY_NO">
  <code>1</code>
  <count>2.0</count>
  <code>2</code>
  <count>2.0</count>
  <code>5</code>
  <count>1.0</count>
</variable>
<variable = "CRASH_DT">
  <code>5/28/2014 1:29:28 PM</code>
  <count>2.0</count>
  <code>7/14/2014 5:42:03 PM</code>
  <count>1.0</count>
  <code>10/2/2014 5:00:08 PM</code>
  <count>1.0</count>
  <code>6/3/2014 10:33:22 AM</code>
  <count>1.0</count>
</variable>
....
Community
  • 1
  • 1
Romain
  • 19,910
  • 6
  • 56
  • 65
  • When I run this (replacing df with my giant dataset imported from Access), I get repeated, various runtime warnings: "RuntimeWarning: Cannot compare type 'Timestamp' with type 'int', sort order is undefined for incomparable objects"; "RuntimeWarning: unorderable types: float() > str(), sort order is undefined for incomparable objects". I think this is because I have a lot of different types of data (datetimes, floats, strings, etc.). Does values_count still work with non-int data types? – ale19 Sep 15 '15 at 17:01
  • For what it's worth, I also get TypeError ("cannot do label indexing on with these indexers [54.263484954833984] of ", 'occurred at index CRASH_ID") – ale19 Sep 15 '15 at 17:11
  • I have simplified my answer to suppress an unnecessary call to `transpose`. Maybe it caused the problem. Try the new version and tell me. – Romain Sep 15 '15 at 20:26
  • Sorry, it still doesn't seem to be working when I run it with the whole dataset. I get the same errors. I think maybe it is running into issues with the data-- it doesn't seem to be able to read the index of floats or datetime data? I have posted the code, the errors, and a sample of the potentially problematic data here: http://pastebin.com/Qbe1d708 – ale19 Sep 16 '15 at 15:02
  • @Alice Thanks, with the original data I have reproduced the problem and made a correction (the answer has been updated in consequence) consisting in converting the index to string (`df.index = df.index.astype(str)`). The remaining warning indicates that values in the index cannot be sorted because it contains incomparables objects. – Romain Sep 16 '15 at 20:09