0

I read this post

encoding a dataframe

but I do not want to encrypt the dataframe, just convert it to base 64. I import a carriage return delimited list of words into a dataframe with:

words = pd.read_table("sampleText.txt",names=['word'], header=None)
words.head()

that gives

    word
0   difference
1   where
2   mc
3   is
4   the

then

words['words_encoded'] = map(lambda x: x.encode('base64','strict'), words['word'])
print (words)

gave

                word                   words_encoded
0         difference  <map object at 0x7fad3e89e410>
1              where  <map object at 0x7fad3e89e410>
2                 mc  <map object at 0x7fad3e89e410>
3                 is  <map object at 0x7fad3e89e410>
4                the  <map object at 0x7fad3e89e410>
...              ...                             ...
999995  distribution  <map object at 0x7fad3e89e410>
999996            in  <map object at 0x7fad3e89e410>
999997      scenario  <map object at 0x7fad3e89e410>
999998          less  <map object at 0x7fad3e89e410>
999999          land  <map object at 0x7fad3e89e410>

[1000000 rows x 2 columns]

I dont understand why my encoded column refers to a map object and not the actual data so I tried:

b64words = words.word.str.encode('base64')
print(b64words)

gives

0        NaN
1        NaN
2        NaN
3        NaN
4        NaN
          ..
999995   NaN
999996   NaN
999997   NaN
999998   NaN
999999   NaN
Name: word, Length: 1000000, dtype: float64

Well,

That threw me so I read the linked answer above and tried

import base64
def encode(text):
    return base64.b64encode(text)
words['Encoded_Column'] = [encode(x) for x in words]

but got

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-89-8cf5a6f1f3a9> in <module>
      2 def encode(text):
      3     return base64.b64encode(text)
----> 4 words['Encoded_Column'] = [encode(x) for x in words]

<ipython-input-89-8cf5a6f1f3a9> in <listcomp>(.0)
      2 def encode(text):
      3     return base64.b64encode(text)
----> 4 words['Encoded_Column'] = [encode(x) for x in words]

<ipython-input-89-8cf5a6f1f3a9> in encode(text)
      1 import base64
      2 def encode(text):
----> 3     return base64.b64encode(text)
      4 words['Encoded_Column'] = [encode(x) for x in words]

~/miniconda3/envs/p37cu10.2PyTo/lib/python3.7/base64.py in b64encode(s, altchars)
     56     application to e.g. generate url or filesystem safe Base64 strings.
     57     """
---> 58     encoded = binascii.b2a_base64(s, newline=False)
     59     if altchars is not None:
     60         assert len(altchars) == 2, repr(altchars)

TypeError: a bytes-like object is required, not 'str'

so I tried converting to a bytes like object like so:

import base64
def encode(text):
    btext = text.str.encode('utf-8')
    return base64.b64encode(btext)
words['Encoded_Column'] = [encode(x) for x in words]

but got

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-90-46db6d3688ba> in <module>
      3     btext = text.str.encode('utf-8')
      4     return base64.b64encode(btext)
----> 5 words['Encoded_Column'] = [encode(x) for x in words]

<ipython-input-90-46db6d3688ba> in <listcomp>(.0)
      3     btext = text.str.encode('utf-8')
      4     return base64.b64encode(btext)
----> 5 words['Encoded_Column'] = [encode(x) for x in words]

<ipython-input-90-46db6d3688ba> in encode(text)
      1 import base64
      2 def encode(text):
----> 3     btext = text.str.encode('utf-8')
      4     return base64.b64encode(btext)
      5 words['Encoded_Column'] = [encode(x) for x in words]

AttributeError: 'str' object has no attribute 'str'

in this C example they are also converting first to byte strings and then to base64 like but I cannot do this simple task in Python.I am falling down this rabbit hole and every attempt just gets me deeper. I really appreciate any help that a clear minded person can give.

aquagremlin
  • 3,515
  • 2
  • 29
  • 51

2 Answers2

4

map returns an iterator, not a list, so pandas simply assigned it to all of the slots in the newly formed "words_encoded" column. Similarly, if you did words['all_ones'] = 1, pandas would assign a 1 down that column.

Secondly, "base64" isn't a codec for strings, it works on bytes. You have to choose a text encoding and then encode that. So,

words['word_encoded'] = words.word.str.encode(
    'utf-8', 'strict').str.encode('base64')

works except that this encoder puts a "\n" on the end of the base64 string, which I find odd. Instead, you can do one of the following

words['word_encoded'] = words.word.str.encode(
    'utf-8', 'strict').apply(
         base64.b64encode)

# or 

words['word_encoded'] = [base64.b64encode(x.encode('utf-8', 'strict'))
    for x in words.word]

Personally I think the first one is a bit more "pandas" as it generates the Series directly without an intermediate list.

The solution in action

>>> import base64
>>> import pandas as pd
>>> words = pd.read_table("sampleText.txt",names=['word'], header=None)
__main__:1: FutureWarning: read_table is deprecated, use read_csv instead, passing sep='\t'.
>>> words['word_encoded'] = words.word.str.encode(
...     'utf-8', 'strict').str.encode('base64')
>>> 
>>> words
         word           word_encoded
0  difference  b'ZGlmZmVyZW5jZQ==\n'
1       where          b'd2hlcmU=\n'
2          mc              b'bWM=\n'
3          is              b'aXM=\n'
4         the              b'dGhl\n'
>>> 
>>> words['word_encoded'] = words.word.str.encode(
...     'utf-8', 'strict').apply(
...          base64.b64encode)
>>> 
>>> words
         word         word_encoded
0  difference  b'ZGlmZmVyZW5jZQ=='
1       where          b'd2hlcmU='
2          mc              b'bWM='
3          is              b'aXM='
4         the              b'dGhl'
>>> 
>>> words['word_encoded'] = [base64.b64encode(x.encode('utf-8', 'strict'))
...     for x in words.word]
>>> 
>>> words
         word         word_encoded
0  difference  b'ZGlmZmVyZW5jZQ=='
1       where          b'd2hlcmU='
2          mc              b'bWM='
3          is              b'aXM='
4         the              b'dGhl'
tdelaney
  • 73,364
  • 6
  • 83
  • 116
  • thank you but no dice. If you use the dataframe as I defined it above, then your first solution (with '.apply') gives 'TypeError: a bytes-like object is required, not 'float'' and the next solution gives 'AttributeError: 'float' object has no attribute 'encode'' – aquagremlin Feb 17 '20 at 02:10
  • @aquagremlin - works for me (see the updated answer). I don't know where the floats come in. I'm using python 3.7.3 and pandas 0.24.2. – tdelaney Feb 17 '20 at 02:32
  • I still get this error in my notebook (python 3.76, pandas 0.23.4) – aquagremlin Feb 17 '20 at 02:51
  • `--------------------------------------------------------------------------- TypeError Traceback (most recent call last) in ----> 1 words['word_encoded'] = words.word.str.encode('utf-8', 'strict').str.encode('base64') ` – aquagremlin Feb 17 '20 at 02:53
  • 1
    ` ~/miniconda3/envs/p37cu10.2PyTo/lib/python3.7/site-packages/pandas/core/strings.py in wrapper(self, *args, **kwargs) 1949 f"inferred dtype '{self._inferred_dtype}'." 1950 ) -> 1951 raise TypeError(msg) 1952 return func(self, *args, **kwargs) 1953 TypeError: Cannot use .str.encode with values of inferred dtype 'bytes'.` – aquagremlin Feb 17 '20 at 02:53
  • but in MS azure online it's fine. WOW. – aquagremlin Feb 17 '20 at 02:54
  • https://2018dataaccess-topjetboy.notebooks.azure.com/j/notebooks/Untitled.ipynb?kernel_name=python36 – aquagremlin Feb 17 '20 at 02:54
  • That is odd! Pandas inferred bytes? I have no explanation for that. – tdelaney Feb 17 '20 at 02:57
  • I got it to work with pandas v'0.24.2. Very bizarre. Apparently pandas 0.25 had new behavior https://pandas.pydata.org/pandas-docs/stable/whatsnew/v0.25.0.html - The .str-accessor performs stricter type checks – aquagremlin Feb 17 '20 at 03:33
  • I wonder if `pd.read_csv(.... encoding="utf-8")` at the beginning would help. – tdelaney Feb 17 '20 at 04:31
0

Simply delete .str from function body. True code:

import base64


def encode(text):
    btext = text.encode('utf-8')
    return base64.b64encode(btext)


words = {'1': 1, '2': 2, '3': 3, 'asdasd': 4}
words['Encoded_Column'] = [encode(x) for x in words]
print(words)

It's output is:

{'1': 1, '2': 2, '3': 3, 'asdasd': 4, 'Encoded_Column': [b'MQ==', b'Mg==', b'Mw==', b'YXNkYXNk']}
Ae_Mc
  • 140
  • 1
  • 5
  • nice try but it does not work if you use the dataframe as I defined it above. I get the error ' ValueError: Length of values does not match length of index' – aquagremlin Feb 17 '20 at 02:06
  • This happens because encoding deletes the header from the dataframe. – aquagremlin Feb 17 '20 at 02:07
  • Actually I dont know why it's happening. issuing words.head() gives the same output before as well as after your code. – aquagremlin Feb 17 '20 at 02:19