2

I need to develop autoencoder using tensorflow, when I am checking the documentation and tutorial I can see many example with image data and MNIST_data which is pre-processed numerical data.

Where as in my case the data is in text format

like,

 uid       orig_h       orig_p   trans_depth      method       host
======================================================================
5fg288   192.168.1.4      80       1               POST       ex1.com
2fg888   192.168.1.3      80       2               GET        ex2.com

So how can I convert these data to numerical format which accept by tensor flow. I couldn't find any example in tensor flow tutorial,

I am beginner in tensor-flow, please help.

Update

Based on the instruction below I have created word to vector mapping by referring the tutorial here

The input in pandas dataframe

   host       method   orig_h        orig_p      trans_depth     uid
0  ex1.com    POST    192.168.1.4      80            1          5fg288
1  ex2.com   GET      192.168.1.3     443            2          2fg888

And

 Bag of word ---> ['5fg288', '2fg888', '80', 'GET', '443', '1', 'ex2.com', '192.168.1.4', '192.168.1.3', '2', 'ex1.com', 'POST']

Now for each cell in I have array of values like

192.168.1.4 ---> [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0]
ex1.com     ---> [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0]
80         ----> [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]

So How can I reshape this data to give tensor flow

should it be like

data = array([
[[0.0,...],[0.0,...],[0.0,...],[0.0,...],[0.0,...],[0.0,...]],
[[0.0,...],[0.0,...],[0.0,...],[0.0,...],[0.0,...],[0.0,...]]
])

That is each feature as an array of float, and there are 6 features in single sample. Is that possible,

CodeDezk
  • 1,230
  • 1
  • 12
  • 40
  • 1
    You will want to write your own `input_fn`, which reads your raw data and converts it to tensors. Take a look at https://www.tensorflow.org/get_started/datasets_quickstart which gives some examples for how to do this. – Zvika Apr 09 '18 at 10:26
  • Actually I have already read the data to pandas dataframe, and now I have to convert to tensorflow input format. – CodeDezk Apr 09 '18 at 10:39
  • In that case you just need [`tf.estimator.inputs.pandas_input_fn`](https://www.tensorflow.org/api_docs/python/tf/estimator/inputs/pandas_input_fn). – Zvika Apr 09 '18 at 11:41
  • Is your set of input strings finite in number? If so, how many are they? If not, what sort of relation do you expect to connect your input to your output? Do you expect new , possibly unseen strings to occur during depolyment? – KonstantinosKokos Apr 11 '18 at 14:29

1 Answers1

2

Tensorflow accepts data in numpy format. Pandas dataframes can be converted to numpy using the df.as_matrix() function. But the crux of your question is how to convert these various data types into continuous numeric representations for a neural network (or any machine learning method).

The answer, linked to below, provides some helpful references to sci-kit documentation which discuss the details, too numerous to re-write here:

Machine learning with multiple feature types in python

Some of your data will translate easily after reading that guide, such as trans_depth orig_p, and method which appear to be categorical data. In cases like this, you will convert them to multiple features of {1,0} values that represent whether that class is present or not, for example, orig_p might be represented as two features x1, and x2. x1=1 if orig_p=80, 0 otherwise, and x2=1 if orig_p=443, 0 otherwise.

You might do the same with the host, but you might have to think out how and if you really want to use the host. For example, if you consider it important you could define a categorical feature that identifies .com, .edu, .org, etc. domains only, because individual hostnames might be too numerous to want to represent.

You might also consider clustering hostnames into categories of hosts based on some database (if such a thing exists), and use the cluster which the hostname belongs to as a categorical feature.

for orig_h you might consider grouping IPs by region and define a categorical feature per region.

uid looks to be unique per user, so you might not use that column of data.

You will need to think this out per data point. Start with the documentation I linked to, but in general, this is a question of standard data mining, any good book on data mining will be invaluable in understanding these concepts further, here's an easy one to find online via a google search:

https://books.google.com/books/about/Data_Mining_Concepts_and_Techniques.html?id=pQws07tdpjoC&printsec=frontcover&source=kp_read_button#v=onepage&q&f=false

I will also include the following reference because they provide the best tutorials I've seen hands down, and their introduction to ML section has a set of articles that will be very useful to read. It's slightly tangent to the question, but will be useful I expect.

https://github.com/aymericdamien/TensorFlow-Examples

David Parks
  • 30,789
  • 47
  • 185
  • 328
  • Thanks for the feedback, right now I am following the tutorial here https://pythonprogramming.net/preprocessing-tensorflow-deep-learning-tutorial/?completed=/using-our-own-data-tensorflow-deep-learning-tutorial/ and I have implemented _1. tokenize the words 2. Used bag of words index to represent each word in dataframe._ _3. Then use these data for training,_ pleas let me know if I am moving in right direction. – CodeDezk Apr 11 '18 at 16:06
  • If you are dealing with text this is one standard approach, and a good one to learn first. Text analysis is also done using Recurrent Neural Networks (the last link there has examples), which handle sequences. In that case you convert the word to an embedded representation using a pre-trained word2vec model such as Glove: https://nlp.stanford.edu/projects/glove/ – David Parks Apr 11 '18 at 16:09
  • Preprocessing sequences for RNNs is a little more involved (e.g. there's a learning curve here), but here's a good article on that process: https://r2rt.com/recurrent-neural-networks-in-tensorflow-iii-variable-length-sequences.html – David Parks Apr 11 '18 at 16:11
  • I have one more query I have found one answer here https://stackoverflow.com/questions/42302498/converting-tensorflow-tutorial-to-work-with-my-own-data which says there exist a method _tf.string_to_number_ which will convert string to number, will it helpful in my case, actually I am confused, I can see several method available like one-hot vector word-tokenizer etc. Please let me know your feedback. – CodeDezk Apr 11 '18 at 16:16
  • That just converts a string `"1"` into a number `1`. Your confusion is well founded. There are many approaches to text. Bag of words has been around a long time and works decently well. It takes a statstical look at the frequency of words. RNNs out perform this approach on tasks such as translation. In these cases representing a word as a 1-hot vector is unreasonable because your dictionary size is 2M+ words. word2vec is a method of reducing this to a meaningful few hundred values so that the small embedded representation of the words can be treated as a sequence and leaned from efficiently. – David Parks Apr 11 '18 at 16:22
  • Learning bag of words models is a good way to introduce yourself. You'll get good results. If you're not familiar with RNNs yet then treat them and word2vec as "knowing what you don't know", and be aware that it'll take some time to fully learn all the details. What you learn from bag of words models will make understanding RNNs and other techniques easier in the future, so time well spent. – David Parks Apr 11 '18 at 16:24
  • 1
    Thanks for your valuable information, my ultimate goal is find anomalies in network generated log file(sample shown in my question) using unsupervised machine learning. I have no label data as input, so using pre-trained(unsupervised) model(with previous day data) I have to find anomalies in newly generated log. So when I search in net I have found autencoder can used in such a case, so I am trying to use one example in tensorflow https://github.com/aymericdamien/TensorFlow-Examples/blob/master/notebooks/3_NeuralNetworks/autoencoder.ipynb – CodeDezk Apr 11 '18 at 16:31
  • I am having still some issue can you get sometime to help me. I can provide the code which I am working right now. – CodeDezk Apr 12 '18 at 14:42
  • Sure, but it is best to open a new question, which you're very welcome to do and welcome to post a reference to here. But it's not good form to continue long comment threads. It's better for the community as a whole to keep each question focused on a particular topic. – David Parks Apr 12 '18 at 15:40
  • Thanks for the response. I will open new question, – CodeDezk Apr 12 '18 at 15:53
  • I have posted the question https://stackoverflow.com/questions/49802119/tensorflow-auto-encoder-with-multi-feature-text-data – CodeDezk Apr 12 '18 at 17:40