Python - Efficiently find the set of all characters in a pandas DataFrame?

Question

I want to find the set of all unique characters contained within a pandas DataFrame. One solution that works is given below:

from operator import add
set(reduce(add, map(unicode, df.values.flatten())))

However, the solution above takes a long time with large DataFrames. What are more efficient ways of doing this?

I am trying to find all unique characters in a pandas DataFrame so I can choose an appropriate delimiter when writing the DataFrame to disk as a csv.

Why not just let Pandas handle writing the DataFrame to a CSV file (`to_csv()`)? No need to choose the delimiter yourself - Pandas handles everything properly. — Alex Riley, Jul 08 '15 at 14:52
Yeah, to expand on previous comment, exactly what problem are you trying to solve? Even if there is a comma inside a string, it shouldn't cause a problem since it will be output inside of quotes. — JohnE, Jul 08 '15 at 18:44

score 0 · Answer 1 · edited May 23 '17 at 12:18

0

Learned this from Jeff here

This should be doable using Pandas built-ins:

a = pd.DataFrame( data=np.random.randint(0,100000,(1000000,20)))

# now pull out unique values (less than a second for 2E7 data points)
b = pd.unique( a.values.ravel() )

edited May 23 '17 at 12:18

Community

1
1

answered Jul 08 '15 at 17:41

tnknepp

5,888
6
43
57

2

I don't think that is what is asked for as poster seems to be looking for individual characters, not ovarall values. E.g. 1.0 has a 1, 0, and period. – JohnE Jul 08 '15 at 18:46

kayoz · Answer 2 · 2017-05-26T08:53:09.773

I realise this is an old question, but I was looking for the same thing and thought I'd share for anyone else looking.

This can be done very quickly with Counter.

Use unstack() to get a list of all values in your dataframe. The result even has a count of each character.

from collections import Counter
df = pd.DataFrame({'A': pd.util.testing.rands_array(100, 100000),
                 'B': pd.util.testing.rands_array(100, 100000)})
Counter(''.join(df.unstack().values))

Timings:

%timeit Counter(''.join(df.unstack().values))
1.1 s ± 38.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Python - Efficiently find the set of all characters in a pandas DataFrame?

2 Answers2