3

I want to find the set of all unique characters contained within a pandas DataFrame. One solution that works is given below:

from operator import add
set(reduce(add, map(unicode, df.values.flatten())))

However, the solution above takes a long time with large DataFrames. What are more efficient ways of doing this?

I am trying to find all unique characters in a pandas DataFrame so I can choose an appropriate delimiter when writing the DataFrame to disk as a csv.

Eriks Dobelis
  • 913
  • 7
  • 16
applecider
  • 2,311
  • 4
  • 19
  • 35
  • 1
    Why not just let Pandas handle writing the DataFrame to a CSV file (`to_csv()`)? No need to choose the delimiter yourself - Pandas handles everything properly. – Alex Riley Jul 08 '15 at 14:52
  • Yeah, to expand on previous comment, exactly what problem are you trying to solve? Even if there is a comma inside a string, it shouldn't cause a problem since it will be output inside of quotes. – JohnE Jul 08 '15 at 18:44

2 Answers2

0

Learned this from Jeff here

This should be doable using Pandas built-ins:

a = pd.DataFrame( data=np.random.randint(0,100000,(1000000,20)))

# now pull out unique values (less than a second for 2E7 data points)
b = pd.unique( a.values.ravel() )
Community
  • 1
  • 1
tnknepp
  • 5,888
  • 6
  • 43
  • 57
  • 2
    I don't think that is what is asked for as poster seems to be looking for individual characters, not ovarall values. E.g. 1.0 has a 1, 0, and period. – JohnE Jul 08 '15 at 18:46
0

I realise this is an old question, but I was looking for the same thing and thought I'd share for anyone else looking.

This can be done very quickly with Counter.

Use unstack() to get a list of all values in your dataframe. The result even has a count of each character.

from collections import Counter
df = pd.DataFrame({'A': pd.util.testing.rands_array(100, 100000),
                 'B': pd.util.testing.rands_array(100, 100000)})
Counter(''.join(df.unstack().values))

Timings:

%timeit Counter(''.join(df.unstack().values))
1.1 s ± 38.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
kayoz
  • 1,104
  • 12
  • 16