Neural Nets/Machine Learning, how to turn data into numbers?

Question

Not sure how to ask this but here it goes. So I have been trying to understand machine learning and the use of neural networks.

I have a simple example of a learning neural network in C#. I understand what the code is doing at this point its pretty simple I have a "Patterns.csv" file. It contains: x input, y input and 0 or 1 for yes or no.

0.11, 0.82, 0
0.13, 0.17, 0
0.20, 0.81, 0
0.21, 0.57, 1
0.25, 0.52, 1
0.26, 0.48, 1

This Patterns.csv is used to train the network so if I manual input simular x and y inputs it will give me a 1 or a 0 relevant to how many patterns I have.

Now my problem is how can I turn actual data into the x and y inputs? Using an image or maybe even a simple spam filter by using strings? I just really don't understand how I can turn actual data into two float numbers.

I'm assuming this would be the correct way to use this simple neural network example if anyone has any ideas or explanations or a cool method to do this please feel free to post anything relevant thanks!

score 1 · Answer 1 · edited May 23 '17 at 11:52

This article article contains a basic algorithm for the so-called "data normalization"

What you have to do is to convert data like

Lives in | IsMarried
Chicago  | 1
New York | 1
New York | 0
...

Into:

Chicago | New York | IsMarried
1       | 0        | 1
0       | 1        | 1
0       | 1        | 0
...

I bet there are other techniques out there, but this is the one we use in our supervised machine learning lecture this semester.

As soon as you have this normalized matrix, you can use any clustering / machine learning algortihm.

Also have a look here. This post explains why this encoding / normalization is needed.

Then why not just replacing Chicago by 0, New York by 1, etc.

That's not a good idea, because some machine learning algorithms handle different values as "distance". Therefor, Chicago (0) and New York (1) (with a distance of 1) wouldn't get the same "unsimilarity rating" like the New York and the 100th city (with a distance of 99)

score 1 · Answer 2 · answered Jan 21 '16 at 18:55

The keyword for your search is encode. There is a good article:

https://visualstudiomagazine.com/articles/2013/07/01/neural-network-data-normalization-and-encoding.aspx

which does a good job of explaining the concept. Here is an excerpt demonstrating a trick to help with training:

An example of independent categorical data is a predictor variable community, which can take values "suburban," "rural" or "city." For such data I recommend using what's often called 1-of-(C-1) effects encoding. Effects encoding is not obvious and is best explained by example:

   suburban = [ 0.0,  0.0,  1.0] 
   rural    = [ 0.0,  1.0,  0.0] 
   city     = [-1.0, -1.0, -1.0]

Neural Nets/Machine Learning, how to turn data into numbers?

2 Answers2