Getting all str type elements in a pd.DataFrame

Question

Based on my little knowledge on pandas，pandas.Series.str.contains can search a specific str in pd.Series. But what if the dataframe is large and I just want to glance all kinds of str element in it before I do anything?

Example like this:

pd.DataFrame({'x1':[1,2,3,'+'],'x2':[2,'a','c','this is']})
    x1  x2
0   1   2
1   2   a
2   3   c
3   +   this is

I need a function to return ['+','a','c','this is']

jezrael · Answer 1 · 2018-04-10T06:24:23.430

3

There are 2 possible ways - check numeric values saved as strings or not.

Check difference:

df = pd.DataFrame({'x1':[1,'2.78','3','+'],'x2':[2.8,'a','c','this is'], 'x3':[1,4,5,4]}) 
print (df)
     x1       x2  x3
0     1      2.8   1
1  2.78        a   4 <-2.78 is float saved as string
2     3        c   5 <-3 is int saved as string
3     +  this is   4

#flatten all values
ar = df.values.ravel()
#errors='coerce' parameter in pd.to_numeric return NaNs for non numeric
L = np.unique(ar[np.isnan(pd.to_numeric(ar, errors='coerce'))]).tolist()
print (L)
['+', 'a', 'c', 'this is']

Another solution is use custom function for check if possible convert to floats:

def is_not_float_try(str):
    try:
        float(str)
        return False
    except ValueError:
        return True

s = df.stack()
L = s[s.apply(is_not_float_try)].unique().tolist()
print (L)
['a', 'c', '+', 'this is']

If need all values saved as strings use isinstance:

s = df.stack()
L = s[s.apply(lambda x: isinstance(x, str))].unique().tolist()
print (L)
['2.78', 'a', '3', 'c', '+', 'this is']

edited Apr 10 '18 at 06:24

answered Apr 10 '18 at 05:30

jezrael

822,522
95
1,334
1,252

This is the best way, IMHO – Ami Tavory Apr 10 '18 at 06:04
@AmiTavory - Thank you. – jezrael Apr 10 '18 at 06:05
It's elegence. I have used `df.apply(lambda x:pd.to_numeric(x,errors='ignore'))` to transform str-numeric like `'1.23'` to `1.23`, so I can tell your function would work on this example. But `np.unique()` may fail if there are lists elements in dataframe. I will vote for you. – Garvey Apr 10 '18 at 06:24
@Garvey - Thank you. `np.unique` should be omit, it is not necessary. – jezrael Apr 10 '18 at 06:26
@Garvey - I am thinking Ìf use `errors='ignore')` how is possible check numeric? Because if `errors='coerce')` it create `NaN`s and possible check it. – jezrael Apr 10 '18 at 06:28
1

@jezrael Well, I set `errors=ignore` on purpose to convert something like `'1.23'` to `1.23` first, which is unrelevent to this question. After that, I realized it necessary to glance what kinds of str were still contained in the dataframe. – Garvey Apr 10 '18 at 06:35

score 3 · Accepted Answer · answered Apr 10 '18 at 07:14

3

If you are looking strictly at what are string values and performance is not a concern, then this is a very simple answer.

df.where(df.applymap(type).eq(str)).stack().tolist()

['a', 'c', '+', 'this is']

answered Apr 10 '18 at 07:14

piRSquared

285,575
57
475
624

score 2 · Answer 3 · answered Apr 10 '18 at 05:14

2

You can using str.isdigit with unstack

df[df.apply(lambda x : x.str.isdigit()).eq(0)].unstack().dropna().tolist()
Out[242]: ['+', 'a', 'c', 'this is']

answered Apr 10 '18 at 05:14

BENY

317,841
20
164
234

score 2 · Answer 4 · answered Apr 10 '18 at 05:14

2

Using regular expressions and set union, could try something like

>>> set.union(*[set(df[c][~df[c].str.findall('[^\d]+').isnull()].unique()) for c in df.columns])
{'+', 'a', 'c', 'this is'}

If you use a regular expression for a number in general, you could omit floating point numbers as well.

answered Apr 10 '18 at 05:14

Ami Tavory

74,578
11
141
185

1

Nice to see you answering questions (-: – piRSquared Apr 10 '18 at 07:15
@piRSquared You too! Your rep has grown tremeeeeeeeeeeeeendously! – Ami Tavory Apr 10 '18 at 08:04

Getting all str type elements in a pd.DataFrame

4 Answers4