1

I read my arff dataframe from here https://archive.ics.uci.edu/ml/machine-learning-databases/00426/ like this:

from scipy.io import arff
import pandas as pd

data = arff.loadarff('Autism-Adult-Data.arff')
df = pd.DataFrame(data[0])

df.head()

But my dataframe has b' in all values in all columns: enter image description here

How to remove it?

When i try this, it doesn't work as well:

from scipy.io import arff
import pandas as pd

data = arff.loadarff('Autism-Adult-Data.arff')
df = pd.DataFrame(data[0].str.decode('utf-8'))

df.head()

It says AttributeError: 'numpy.ndarray' object has no attribute 'str' as you see .str.decode('utf-8') from Removing b'' from string column in a pandas dataframe didn't solve a problem

This doesn't work as well:

df.index = df.index.str.encode('utf-8')

A you see its both string and and numbers are bytes object

french_fries
  • 1,149
  • 6
  • 22
  • Does this help? https://stackoverflow.com/questions/606191/convert-bytes-to-a-string – Shimon Cohen Dec 16 '20 at 13:36
  • @ShimonCohen i don't think so – french_fries Dec 16 '20 at 14:02
  • Hey bro, I'm having the same problem, I also realized we are working on same dataset :) Have you figured out the problem? I understand that you may have forgotten about this considering it's been some time, however I would ask if you could contact me and perhaps send me source code of anything you worked on this so far? My email: vanjavk@hotmail.com Thank you! @french_fries – vanjavk Feb 10 '21 at 12:05

2 Answers2

1

I was looking at the same dataset and had a similar issue. I did find a workaround and am not sure if this post will be helpful? So rather than use the from scipy.io import arff, I used another library called liac-arff. So the code should be like

pip install liac-arff

Or whatever the pip command that works for your operating system or IDE, and then

import arff
import pandas as pd

data = arff.loads('Autism-Adult-Data.arff')

Data returns a dictionary. To find what columns that dictionary has, you do

data.keys()

and you will find that all arff files have the following keys

['description', 'relation', 'attributes', 'data']

Where data is the actual data and attributes has the column names and the unique values of those columns. So to get a data frame you need to do the following

colnames = []
for i in range(len(data['attributes'])):
  colnames.append(data['attributes'][i][0])

df = pd.DataFrame.from_dict(data['data'])
df.columns = colnames
df.head()

So I went overboard here with all creating the dataframe and all but this returns a data frame with no issues with a b', and the key is using import arff.

So the GitHub for the library I used can be found here.

Yannis P.
  • 2,745
  • 1
  • 24
  • 39
Stainaz
  • 11
  • 5
0

Although Shimon shared an answer, you could also give this a try:

df.apply(lambda x: x.str.decode('utf8'))
mccandar
  • 778
  • 8
  • 16