0

I'm currently building a model for anomaly detection. At the moment, I'm still busy with preparing the data before applying the model. The entity, I want to classify as anomalous or not has information associated to it in a one-to-many relationship. Im curious, whether they are any best practices on how to feed information from such a relationship into a table where each line corresponds to one instance and later into the model. Is my approach valid or are there better methods? Is there any way to optimize it?

Originally the data is organized as such:

Each instance has N associated row's in another table. Each of these rows corresponds to an object that has been provided in connection with the entity in question. Each object is associated with an amount and a corresponding value. Mind you, that there is a vast pool of different objects available and only a small set of objects is used in connection with one entity.

I implemented an approach somewhat similiar to One-Hot-Encoding. I created a feature for every objecttype possible. Then I looped through the table containing the objects and for each row I added the value of the object to the respective column (of the respective objecttype) in the target table.

# X Dataframe containing the entity I want to classify
# data Dataframe containing several rows for each entity in X, each row associates the entities of X with a object, its type, quantity and value
target = pd.DataFrame()
#creating a feature for each possible objecttype
for _object in data.objecttype.unique():
    target[_object] = pd.Series(dtype='float64')
#add the id's of all entities
target['id'] = entities.id.unique() 
target.fillna(0.0, inplace=True)
target.set_index('id', inplace=True)

#loop through data table and add the value of each object the the column of the object in the entity table for the entity specified by id
for index, row in data.interrows():
    target.loc[row['id'], row['_objecttype']] = row['value']

The code seems to work, but its runtime is nothing but terrible. Also I'm not sure whether there are better methods to incorporate information like that.

UPDATE: https://github.com/rubenweinstock/stackoverflow_questions I added 3 files that represent the structure of the data im dealing with (entity, objects) and the result

0 Answers0