0

I need to separate some data that I got. I'm using pandas DataFrame in order to do this.

Here is the code before my problem:

import pandas as pd
from sklearn.datasets import load_iris
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.svm import LinearSVC
from sklearn.metrics import ConfusionMatrixDisplay
arquivo_arff = arff.loadarff(r"/content/Rice_MSC_Dataset.arff")
dados = pd.DataFrame(arquivo_arff[0])
dados = dados[['MINOR_AXIS', 'MAJOR_AXIS', 'CLASS']]

I've already done a scatterplot graph with 5 parameters to analyze with this code (0 filters):

sns.scatterplot(
    data=dados, 
    x="MINOR_AXIS", 
    y="MAJOR_AXIS", 
    hue="CLASS")
plt.show()

My problem: I need to filter only the species b'Basmati' and b'Ipsala', but i'm unnable to do that, I don't know why.

The "CLASS" parameters are: b'Basmati',b'Arborio',b'Jasmine',b'Ipsala',bKaracadag'

But, in the ".arff" file that I used, the names are only "Basmati,Arborio,Jasmine,Ipsala,Karacadag"

What I've tried: Filter only this two species, with this code:

dados = dados[dados['CLASS'].isin(["" "b'Arborio'" "", "" "b'Ipsala'" ""])]

Didn't work. How can I fix this?

Lucio
  • 1

1 Answers1

0

The system has somewhere shown to you that the class parameters are b'Basmati', b'Arborio', b'Jasmine', b'Ipsala' and b'Karacadag'. However, this does not mean that the parameters are actually these characters inside a string. These are the repr representations of the strings that contain the parameters. Apparently, these string are byte strings, which are created by placing a b in front of the string, hence the strange repr representations.

The solution to your problem is to provide the strings, "Arborio" and "Ipsala", to pandas as a byte string, by placing a b in front of them:

dados = dados[dados['CLASS'].isin([b"Arborio", b"Ipsala"])]
The_spider
  • 1,202
  • 1
  • 8
  • 18