0

Alright, so I'm working on this ML problem in which I have to predict the probability that a blight ticket would be issued to a person. I tried to set ticket_id as the dataframe's index but something weird happened and I don't know why.

import pandas as pd
import numpy as np

def blight_model():
    train = pd.read_csv('train.csv', encoding = "ISO-8859-1")
    test = pd.read_csv('readonly/test.csv', encoding = "ISO-8859-1")
    address = pd.read_csv('readonly/addresses.csv', encoding = "ISO-8859-1")
    """X = data.iloc[:,0:33]  #independent columns
    y = data.iloc[:,-1]    #target column i.e price range"""
    common_cols_to_drop = ['agency_name', 'inspector_name', 'mailing_address_str_number',
                           'violator_name', 'violation_street_number', 'violation_street_name',
                           'mailing_address_str_name', 'admin_fee', 'violation_zip_code',
                           'state_fee', 'late_fee', 'ticket_issued_date', 'hearing_date', 'violation_description',
                           'fine_amount', 'clean_up_cost', 'disposition', 'grafitti_status',
                           'violation_code', 'city']
    train_cols_to_drop = ['payment_status', 'payment_date', 'balance_due', 'payment_amount','compliance_detail', 'collection_status'] + common_cols_to_drop
    train = train.drop(train_cols_to_drop, axis=1).set_index('ticket_id')
    train = train[np.isfinite(train['compliance'])]

    return train.head()

The result I'm getting is this? What's with the ticket_id?

Imgur

jukebox
  • 453
  • 2
  • 8
  • 24
  • https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.set_index.html. Look at the examples. I believe you are setting the index to the removed row not the the table. You should set_index right after the train variable when you read the csv – Filip Bartoš May 04 '19 at 07:40
  • 1
    My solution here has more information about the placement of index and column names: https://stackoverflow.com/questions/55027108/pandas-rename-index/55028542#55028542. It's necessary so you always know which is which. – ALollz May 04 '19 at 16:01

1 Answers1

1

It is just displayed in this way so you know that it is an index.

You are getting that because of this (setting it as index):

.set_index('ticket_id')

in this line:

train = train.drop(train_cols_to_drop, axis=1).set_index('ticket_id')
Klemen Koleša
  • 446
  • 3
  • 6