0

I have a pandas dataframe which has byte strings as elements in a column: E.g. b'hey'.

When I write this dataframe to a csv and read if afterwards, pandas will return a string with the following form "b'hey'". This is a problem, because when calling tf.data.Dataset.from_tensor_slices the string will be casted to a byte string again and will have the following form: b"b'hey'". Specifying the dtype when reading the csv with dtype = {"COLUMN_NAME":bytes} didn't to anything.

Has anyone a solution to this without manually changing the string and removing the b?

Quasi
  • 576
  • 4
  • 13
  • 1
    Does this answer your question? [How to translate "bytes" objects into literal strings in pandas Dataframe, Python3.x?](https://stackoverflow.com/questions/40389764/how-to-translate-bytes-objects-into-literal-strings-in-pandas-dataframe-pytho) – RJ Adriaansen Nov 12 '21 at 20:21

1 Answers1

0

The solution is to apply ast.literal_eval first before decode with 'utf-8'.

To read and convert whole column with byte string:

import pandas as pd
import ast
df = pd.read_csv(<YOUR_DATA_FILE>, sep='\t')
df['text'].apply(ast.literal_eval) # assume the column is named with 'text'
df['text'] = df['text'].apply(lambda x: ast.literal_eval(x).decode("utf-8"))
user3786340
  • 190
  • 1
  • 8