Iterate over two dataframes' columns and str.encode in utf8

Question

I'm currently running on Python 2.7 and have two dataframes x and y. I would like to use some sort of list comprehension to iterate over both columns and use str.encode('UTF8) on each column to get rid of unicode.

This works perfectly fine and is easily readable but wanted to try to use something faster and more efficient.

for col in y:
  if y[col].dtype=='O':
    y[col] = y[col].str.encode("utf-8")

for col in x:
  if x[col].dtype=='O':
    x[col] = x[col].str.encode("utf-8")

Other methods I have tried:

1.)[y[col].str.encode("utf-8") for col in y if y[col].dtype=='O' ]

2.)y.columns= [( y[col].str.encode("utf-8") if y[col].dtype=='O' else y[col]) for col in y ]

3.)y.apply(lambda x : (y[col].str.encode("utf-8") for col in y if y[col].dtype=='O'))

I am getting valueerrors and length mismatch errors for 2.) and 3.)

write list comprehension as normal `for` loop so maybe error will show which part of code makes problem, and you may add `print()` in normal `for` loop to see values in variables. You could also use `len()` to if there is `length mismatch errors` — furas, Apr 07 '19 at 19:43
I don't understand two things: (1) why do you assing result to `y.columns` instead of `y[col]` ? (2) `apply()` gives you `x` but you don't convert it - in place of single value `x` you try to put `y[col]` or rather generator `(...)`. — furas, Apr 07 '19 at 19:52
in `apply()` you use `(y[col] ... )` and this code creates generator. — furas, Apr 07 '19 at 19:55
y.apply(lambda y : (y.str.encode("utf-8") for col in y if y.dtype=='O')) does this seem reasonable? — TH14, Apr 07 '19 at 20:01
`apply()` gives you value from single cell, not full column - see code in @coldspeed answer - `u.apply(lambda x: x.str.encode('utf-8'))` — furas, Apr 07 '19 at 20:05

score 4 · Accepted Answer · answered Apr 07 '19 at 19:52

4

You can use select_dtypes to get object columns, then call apply over each column to encode it:

u = df.select_dtypes(include=[object])
df[u.columns] = u.apply(lambda x: x.str.encode('utf-8'))

Write a small function to do this and call it for each dataframe.

def encode_df(df):
    u = df.select_dtypes(include=[object])
    df[u.columns] = u.apply(lambda x: x.str.encode('utf-8'))
    return df

x, y = encode_df(x), encode_df(y)

answered Apr 07 '19 at 19:52

cs95

379,657
97
704
746

even after applying both of those methods I still see this for all columns Index([u'Name', u'Type'], dtype='object') – TH14 Apr 07 '19 at 21:21
@TH14 You have not shown your data, so there is nothing further I can do to help you. I suggest opening a new question and explaining why the current answer does not work. – cs95 Apr 07 '19 at 21:22
what exactly are the implications of the unicode "u" in column names? – TH14 Apr 07 '19 at 21:23
@TH14 It's a python2 thing where strings could either be `str` or `unicode`. Now-a-days, strings only have a single type: `str`. – cs95 Apr 07 '19 at 21:29
oh okay got it, thank you for your help! I'm not used to dealing with unicode issues since I've only ever worked with Python 3. – TH14 Apr 07 '19 at 21:32

score 0 · Answer 2 · answered Apr 07 '19 at 20:24

0

Use this:

import pandas as pd
import numpy as np

df = pd.DataFrame({'a':[1,2,3,4], 'b':[11,12,13,14]})

def f(x):
    return x**2

pd.DataFrame([[f(i) for i in tuple(v)] for k,v in df.iterrows()], columns=df.columns)

answered Apr 07 '19 at 20:24

Akshay Sehgal

18,741
3
21
51

1

[Don't use `iterrows`](https://stackoverflow.com/a/55557758/4909087). You can directly iterate over `df.values` or `zip` just the two columns you need together. – cs95 Apr 07 '19 at 20:38

Iterate over two dataframes' columns and str.encode in utf8

2 Answers2