Returning unique values in .csv and unique strings in python+pandas

Question

my question is very similar to here: Find unique values in a Pandas dataframe, irrespective of row or column location

I am very new to coding, so I apologize for the cringing in advance.

I have a .csv file which I open as a pandas dataframe, and would like to be able to return unique values across the entire dataframe, as well as all unique strings.

I have tried:

for row in df:
    pd.unique(df.values.ravel())

This fails to iterate through rows.

The following code prints what I want:

  for index, row in df.iterrows():
        if isinstance(row, object):
            print('%s\n%s' % (index, row))

However, trying to place these values into a previously defined set (myset = set()) fails when I hit a blank column (NoneType error):

for index, row in df.iterrows():
    if isinstance(row, object):
        myset.update(print('%s\n%s' % (index, row)))

I get closest to what I was when I try the following:

 for index, row in df.iterrows():
        if isinstance(row, object):
            myset.update('%s\n%s' % (index, row))

However, my set prints out a list of characters rather than the strings/floats/values that appear on my screen when I print above.

Someone please help point out where I fail miserably at this task. Thanks!

The reason the first line: `for row in df:` fails is because this returns the columns rather than the rows — EdChum, Jan 16 '15 at 18:54
Yes, thanks, but when I try: 'code'for index, row in df.iterrows(): pd.unique(df.values.ravel())'code' I get a first row that repeats (apparently infinitely) without ever seeming to progress to printing values of additional rows indices. — Nerfail, Jan 16 '15 at 20:29
What you are doing in that case is trying to get a flattened unique set of values for the whole df. What are you ultimately trying to achieve, for instance you can get an array of all the unique values by doing `pd.unique(df.values)` — EdChum, Jan 16 '15 at 20:33
Unfortunately, that gives me the following error: "Traceback (most recent call last): File "", line 1, in pd.unique(df.values) File "/usr/lib/python3/dist-packages/pandas/core/algorithms.py", line 54, in unique values = com._asarray_tuplesafe(values) File "/usr/lib/python3/dist-packages/pandas/core/common.py", line 1491, in _asarray_tuplesafe result[:] = values ValueError: could not broadcast input array from shape (36634,18) into shape (36634)" However, your suggestion is very similar to one here in my next comment below (space issues) — Nerfail, Jan 16 '15 at 22:08
http://stackoverflow.com/questions/26492270/is-there-a-memory-efficient-way-to-replace-a-list-of-values-in-a-pandas-datafram `unique_string_list = pd.unique(df.values.ravel()).tolist()` Upon entering `unique_string_list` after this line, I appear to get a list of unique values, but I think I am running out of memory shortly after this. What I want is a list of all unique values in the dataframe, and generally all unique strings to pare this to a manageable size. My ultimate use for these values is look at correlations among subgroupings of data and flag groupings of interest. Ran out of spac — Nerfail, Jan 16 '15 at 22:13
I'm not sure you will be able to get the data in that way, you could create a dict of the unique values for each col like so :`vals={} for col in df: vals[col] = df[col].unique()` would give you a dict of all the unique values for each column — EdChum, Jan 16 '15 at 22:22

knightofni · Answer 1 · 2015-01-20T02:03:09.470

0

I think the following should work for almost any dataframe. It will extract each value that is unique in the entire dataframe.

Post a comment if you encounter a problem, i'll try to solve it.

# Replace all nones / nas by spaces - so they won't bother us later
df = df.fillna('')

# Preparing a list
list_sets = []

# Iterates all columns (much faster than rows)
for col in df.columns:
    # List containing all the unique values of this column
    this_set = list(set(df[col].values))  
    # Creating a combined list
    list_sets = list_sets + this_set

# Doing a set of the combined list
final_set = list(set(list_sets))

# For completion's sake, you can remove the space introduced by the fillna step
final_set.remove('')

Edit :

I think i know what happens. You must have some float columns, and fillna is failing on those, as the code i gave you was replacing missing values with an empty string. Try those :

df = df.fillna(np.nan) or
df = df.fillna(0)

For the first point, you'll need to import numpy first (import numpy as np). It must already be installed as you have pandas.

edited Jan 20 '15 at 02:03

answered Jan 18 '15 at 06:28

knightofni

1,906
3
17
22

This looks really good, but I am having a slight problem with the first line (and probably the last). I get "ValueError: could not convert string to float." – Nerfail Jan 18 '15 at 18:27
Try the update after the edit. The problem is due to your dataframe having some floats. – knightofni Jan 20 '15 at 02:05
Thank you! I should have been more specific... yes, I have floats in my dataframe. I was also able to filter out my returned values as strings, if desired, with the optional code following. `unique_value_list = pd.unique(df.values.ravel()).tolist() remove = [] for i in unique_value_list: if isinstance(i, (int, float)): remove.append(i) unique_string_list = list(set(unique_value_list) - set(remove))` – Nerfail Jan 27 '15 at 20:48

Returning unique values in .csv and unique strings in python+pandas

1 Answers1

Edit :