1

All I have successfully written a list comprehension that tests for non ascii characters in a column in a dataframe.

I am trying now to write a nest listed comprehension to check all of the columns in the data frame.

I have researched this by searching nested List Comprehensions dataframes and several other variations and while they are close I can get them to fit my problem.

Here is my code:

import pandas as pd
import numpy as np

data = {'X1': ['A', 'B', 'C', 'D', 'E'], 
        'X2': ['meow', 'bark', 'moo', 'squeak', '120°']}

data2 = {'X1': ['A', 'B', 'F', 'D', 'E'], 
         'X3': ['cat', 'dog', 'frog', 'mouse®', 'chick']}

df = pd.DataFrame(data)
df2 = pd.DataFrame(data2)

dfAsc = pd.merge(df, df2, how ='inner', on = 'X1')
dfAsc['X2']=[row.encode('ascii', 'ignore').decode('ascii') for row in 
    dfAsc['X2'] if type(row) is str]
dfAsc

which correctly returns:

X1  X2  X3
0   A   meow    cat
1   B   bark    dog
2   D   squeak  mouse®
3   E   120 chick

I have tried to create a nested comprehension to check all of the columns instead of just X2. The attempt below is to create a new df that contains the answer. If this continues to be an issue of confusion, I'll will delete it as it is only one of my attempts to obtain the answer, don't get hung up on it please

df3 = pd.DataFrame([dfAsc.loc[idx]
                for idx in dfAsc.index
                [row.encode('ascii', 'ignore').decode('ascii') for row in 
                 dfAsc[idx] if type(row) is str]    
df3    

which doesnt work. I know Im close but Im still having trouble getting my head around comprehensions

Mick Hawkes
  • 47
  • 2
  • 9
  • you need to use `.apply()` with a function inside, otherwise you are trying to make a list. please see https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.apply.html – Evgeny Jun 19 '18 at 04:55
  • 1
    what are you expecting from `df3 = pd.DataFrame(...)`? please add to the question – aydow Jun 19 '18 at 04:56
  • as above, expected `df3` is important to know too – Evgeny Jun 19 '18 at 04:57
  • @Evgeny Pogrebnyak - If I can change one column, why can't i change all of them with a nested CL? – Mick Hawkes Jun 19 '18 at 05:13
  • you can as an excercise, but you loose the the benefit of a dataframe by trying to go by index manually. there is a nice example by @Amey Dahale below. – Evgeny Jun 19 '18 at 05:54

2 Answers2

1

You don't need to use list comprehension, you can directly use df.applymap This will be lot faster than using comprehensions.

data = {'X1': ['A', 'B', 'C', 'D', 'E'], 
        'X2': ['meow', 'bark', 'moo', 'squeak', '120°']}

data2 = {'X1': ['A', 'B', 'F', 'D', 'E'], 
         'X3': ['cat', 'dog', 'frog', 'mouse®', 'chick']}

df1 = pd.DataFrame(data, index=data['X1'], columns=['X2'])
df2 = pd.DataFrame(data2, index=data2['X1'], columns=['X3'])

dfAsc = pd.merge(df1, df2, how ='inner', left_index=True, right_index=True)

dfAsc = dfAsc.applymap(lambda x: x.encode('ascii', 'ignore').decode('ascii') if isinstance(x, str) else x)

>>> dfAsc

       X2     X3
A    meow    cat
B    bark    dog
D  squeak  mouse
E     120  chick
Amey Dahale
  • 750
  • 6
  • 10
  • wow thanks for that, but not all of my columns are strings, will this still work without testing for type? – Mick Hawkes Jun 19 '18 at 05:52
  • You can anyways prepend the return value with condition checking if it's string. Edited the code with condition – Amey Dahale Jun 19 '18 at 05:53
  • ususally `isinstance(s, str)` is better than `==` type check – Evgeny Jun 19 '18 at 05:55
  • also for learning purposes: you can have function defined with `def` before expression, not necessarily a `lambda` – Evgeny Jun 19 '18 at 05:57
  • 1
    @MickHawkes You can accept the answer if it best serves your purpose – Amey Dahale Jun 19 '18 at 05:59
  • @Mick Hawkes: in a more realistic case your columns would all be of single type and you would know what columns to apply the fucntion to, so checking type might not be necessary - the type ususally does not change row by row – Evgeny Jun 19 '18 at 05:59
  • @Evgeny - in my realistic case, the columns are all of different types, some are dates and one is integer. I like this solution as I can apply it to the whole data frame. How would you apply a function to this? – Mick Hawkes Jun 19 '18 at 06:08
  • you are already applying a function , `lambda` is way to define a fucntion – Evgeny Jun 19 '18 at 06:21
1

As a follow-up for comments:

def clean(x):   
    try:
         return x.encode('ascii', 'ignore').decode('ascii') 
    except AttributeError:
         return x

dfAsc = dfAsc.applymap(clean)

lambda is a usual way of defining you transformation in .apply(), but you can also read that def is preferred for readability.

As for type check all the elements in the dfAsc dataframe are strings, including '120°' and later 120:

dfAsc.applymap(lambda x: isinstance(x, str))
#Out[37]: 
#     X1    X2    X3
#0  True  True  True
#1  True  True  True
#2  True  True  True
#3  True  True  True

On import with pd.read_csv() the type may be selected by column. If the dfAsc. Some dignostics can be done with dfAsc.dtypes and change of type with .astype() method.

Evgeny
  • 4,173
  • 2
  • 19
  • 39