Data separation for ML

Question

I have imported a data set for a Machine Learning project. I need each "Neuron" in my first input layer to contain one numerical piece of data. However, I have been unable to do this. Here is my code:

import math
import numpy as np
import pandas as pd; v = pd.read_csv('atestred.csv', 
error_bad_lines=False).values
rw = 1
print(v)
for x in range(0,10):
    rw += 1
    s = (v[rw])
list(s)
#s is one row of the dataset 
print(s)#Just a debug.
myvar = s
class l1neuron(object):
    def gi():
        for n in range(0, len(s)):
            x = (s[n])
            print(x)#Just another debug 
n11 = l1neuron
n11.gi()

What I would ideally like is a variant of this where the code creates a new variable for every new row it extracts from the data(what I try to do in the first loop) and a new variable for every piece of data extracted from each row (what I try to do in the class and second loop).

If I have been completely missing the point with my code then feel free to point me in the right direction for a complete re-write.

Here are the first few rows of my dataset:

fixed acidity;"volatile acidity";"citric acid";"residual sugar";"chlorides";"free sulfur dioxide";"total sulfur dioxide";"density";"pH";"sulphates";"alcohol";"quality"
7.4;0.7;0;1.9;0.076;11;34;0.9978;3.51;0.56;9.4;5
7.8;0.88;0;2.6;0.098;25;67;0.9968;3.2;0.68;9.8;5
7.8;0.76;0.04;2.3;0.092;15;54;0.997;3.26;0.65;9.8;5

Thanks in advance.

Please post some data i.e. a minimal example of your dataset. — Cleb, Dec 30 '17 at 20:29
One thing jumps out: you shouldn't have to manually loop through `v`. It should already be a numpy array of values from `atestred.csv`. — Peter Leimbigler, Dec 31 '17 at 04:39
If so, how should I separate the data values and assign each one to a variable, preferably inside the neuron class.@PeterLeimbigler — 3141, Dec 31 '17 at 12:34
I'm afraid it is pretty unclear what you want to do. As already stated above, all your data are already stored in `v` and you can easily access each value by indexing. For instance, `v['citric acid'][2]` gives you the value for citric acid in the third row of `v`. If you want to create a different variable for each row-column pair, how would you want to name them and how would your later code know these names? — Thomas Kühn, Feb 12 '18 at 13:25
Maybe this is just another [xy problem](https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem) and instead of trying to solve the problem you formulate, you should tell us what you *actually* want to do and we can tell you how to solve that *actual problem*. — Thomas Kühn, Feb 12 '18 at 13:28
might be OT, but I would strongly suggest you to use some framework for neural networks (keras), or research more on how neural nets work. There are numerous examples on the web implementing vanilla neural nets from scratch. From the example you posted I feel you don't really know what you are doing (no offense) — redacted, Feb 12 '18 at 13:29
Your class does not have a constructor and you are not calling the constructor using (). I'm not sure why you would need a class, but I guess it's a stub? — noumenal, Feb 12 '18 at 14:37
I was trying to use the class to get around the problem before I posted @noumenal — 3141, Feb 12 '18 at 15:34
Could you give an example of the expected output, given a minimal input test case? (Classes are good for gathering objects, often with a group of behaviors. but you probably just need to populate an array - depending on the number of dimensions.) What is the current output of your code? What ML algorithm do you intend to implement? — noumenal, Feb 12 '18 at 15:57
I am attempting to create a neural network trained through backpropogation without any ml libraries. At the moment, all I need is for each "piece" of data to be assigned to its own variable, or at least something that has the same effect. — 3141, Feb 12 '18 at 16:08
There is a misunderstanding here. You should not create a variable for each value. The first layer of your neural network has to have 12 neurons (the count of columns in your data). Then row by row you have to supply those values in rows. — Emmet B, Feb 17 '18 at 14:54

score 2 · Answer 1 · answered Feb 14 '18 at 19:08

If I understand your problem well, you would like to convert each row in your csv-table into a separate variable, that in turn holds all the values of that row. Here is an example of how you might approach this. There are many ways to that end, and others may be more efficient, faster, more pythonic, hipper or whatever. But the code below was written to help you understand how to store tabellic data into named variables.

Two remarks:

if reading the data is the only thing you need pandas for, you might look for a less complex solution
the L1Neuron-class is not very transparant while it's members cannot be read from code, but instead are created runtime by the list of variables in attrs. You may want to have a look at namedTuples for better readability instead.

`

import pandas as pd 
from io import StringIO
import numbers


# example data:
atestred = StringIO("""fixed acidity;volatile acidity;citric acid;\
residual sugar;chlorides;free sulfur dioxide;total sulfur dioxide;\
density;pH;sulphates;alcohol;quality
7.4;0.7;0;1.9;0.076;11;34;0.9978;3.51;0.56;9.4;5
7.8;0.88;0;2.6;0.098;25;67;0.9968;3.2;0.68;9.8;5
7.8;0.76;0.04;2.3;0.092;15;54;0.997;3.26;0.65;9.8;5
""")



# read example data into dataframe 'data'; extract values and column names:
data     = pd.read_csv(atestred, error_bad_lines=False, sep=';') 
colNames = list(data)



class L1Neuron(object):
    "neuron class that holds the variables of one data line"

    def __init__(self, **attr):
        """
        attr is a dict (like {'alcohol': 12, 'pH':7.4});
        every pair in attr will result in a member variable 
        of this object with that name and value"""
        for name, value in attr.items():
            setattr(self, name.replace(" ", "_"), value)

    def gi(self):
        "print all numeric member variables whose names don't start with an underscore:"
        for v in sorted(dir(self)):
            if not v.startswith('_'):
                value = getattr(self, v) 
                if isinstance(value, numbers.Number): 
                    print("%-20s = %5.2f" % (v, value))
        print('-'*50)


# read csv into variables (one for each line):        
neuronVariables = []        
for s in data.values:
    variables   = dict(zip(colNames, s))
    neuron      = L1Neuron(**variables)
    neuronVariables.append(neuron)

# now the variables in neuronVariables are ready to be used:     
for n11 in neuronVariables:
    print("free sulphur dioxide in  this variable:", n11.free_sulfur_dioxide, end = " of ")
    print(n11.total_sulfur_dioxide,  "total sulphur dioxide" )
    n11.gi()

Thanks, I actually wanted every number in the table to be its own variable, but I'm sure I can do that by extending your code. — 3141, Feb 16 '18 at 21:38

score 1 · Answer 2 · answered Feb 17 '18 at 08:42

If this is for a machine learning project, I would recommend loading your CSV into a numpy array for ease of manipulation. You store every value in the table as its own variable, but that will give you a performance hit by preventing you from using vectorized operations, as well as make your data more difficult to work with. I'd suggest this:

from numpy import genfromtxt my_data = genfromtxt('my_file.csv', delimiter=',')

If your machine learning problem is supervised, you'll also want to split your labels into a separate data structure. If you're doing unsupervised learning, though, a single data structure will suffice. If you provide additional context on the problem you're trying to solve, we could provide you with additional context and guidance.

Data separation for ML

2 Answers2