3

I am doing an assignment for machine learning class in python. I started learning python just yesterday so I am not aware of practices used in python.

Part of my task is to load data from csv (2D array) lets call it arr_2d and normalize that.

I've found sklearn and numpy solutions online but they expect 2D array as input.

My approach after loading arr_2d is to parse them into array of objects (data: [HealthRecord]).

My solution was a code similar to this (note: kinda pseudocode)

result = [] # 2D array of property values
for key in ['age','height','weight',...]:
    tmp = list(map(lambda item: getattr(key, item), data))
    result.append(tmp)

Result now contains 3 * data.length items and I would use sklearn to normalize single row in my result array, then rotate it back and parse normalized to HealthRecord.

I see this as overcomplicated and what I would like to see an option to do it any easier way, like sending [HealthRecord] to sklearn.normalize


Code below shows my (simplified) loading and parsing:

class Person: 
    age: int
    height: int
    weight: int
    

def arr_2_obj(data: [[]]) -> Person:
    person = Person()
    person.age = data[0]
    person.height = data[1]
    person.weight = data[2]

    return person


# age (days), height (cm), weight (kg)
input = [
    [60*365, 125, 65],
    [30*365, 195, 125],
    [13*365, 116, 53],
    [16*365, 164, 84],
    [12*365, 125, 96],
    [10*365, 90, 46],    
]

parsed = []

for row in input:
    parsed.append(arr_2_obj(row))  

note: Person class is HealthRecord

Thank you for any input or insights.

Edit: typo sci-learn -> sklearn

Chiffie
  • 581
  • 3
  • 18
  • Id like to add that real length of parsed CSV is 70000*13. Also that i am parsing data to class for easier manipulation. After loading dataset i am cleaning rows based on incorrect or out of range values and coding text values to numerical. – Chiffie Oct 05 '20 at 12:59
  • Does this answer your question? [How to normalize an array in NumPy?](https://stackoverflow.com/questions/21030391/how-to-normalize-an-array-in-numpy) – Joe Oct 05 '20 at 13:34
  • @Joe No it does not. I am aware of thread you mentioned and as I stated in my question I was looking for another approach. Said thread considers 2D array as input, while I'd like to pass array of objects to normalize. – Chiffie Oct 05 '20 at 15:23

1 Answers1

3

You can't. In practice, you're dealing with tabular data. The standard (as in most popular, not standard library) package in python to process tabular data is pandas, so you can do something like:

import pandas as pd
df = pd.DataFrame([d.__dict__ for d in data])
normalized_df = (df-df.mean())/df.std() # example normalization 

If you insist on dealing with arrays of objects instead of tables, you can write a class which does the required conversions to shorten notations, e.g. something like

class ObjectList: 
    def __init__(self, object_type, records): 
        self.objects = [object_type(**record) for record in records]

    def to_data_frame(self): 
        return pd.DataFrame([d.__dict__ for d in self.objects])

class PersonList(ObjectList): 
    def __init__(self, records): 
        super().__init__(Person, records)

The above assumes class Person has an __init__ function accepting arguments height, age, weight.

You can also try to shorten notations further by overloading operators, but unless you're writing library code I don't see why you would want to.

Yuri Feldman
  • 2,354
  • 1
  • 20
  • 23
  • Thank you for your help. I'm marking 'you cant' as right answer. I was already loading CSV with pandas as `DataFrame` but using it on my 4MB csv (70000 rows, 13cols) it takes much much longer. What approach and structure do you propose? edit: by longer i meant working with DataFrame.values instead whole DataFrame was way faster – Chiffie Oct 05 '20 at 15:26
  • 1
    pandas.read_csv is the fastest way I know to read a csv in python (contrary to numpy functions for that), after that it depends specifically what computations you're doing with it - generally for machine learning you need your data in numeric form, i.e. as numpy ndarrays or tensors, you may need to translate categorical fields to 1-hot vectors, eliminate fields you're perhaps not going to directly use such as strings etc. In any case you need a bunch of matrices for computations eventually. – Yuri Feldman Oct 05 '20 at 15:47