Store textual dataset for binary classification

Question

I am currently working on a machine learning project, and am in the process of building the dataset. The dataset will be comprised of a number of different textual features, of varying length from 1 sentence to around 50 sentences(including punctuation). What is the best way to store this data to then pre-process and use for machine learning using python?

score 0 · Answer 1 · answered Nov 22 '16 at 05:32

One common way is to create your dictionary(all the posible words) and then encode every of your examples in function of this dictonary, for example(this is a very small and limited dictionary just for example) you could have a dictionary : hello ,world, from, python . Every word will be associated to a position, and in every of your examples you define a vector with 0 for inexistence and 1 for existence, for example for the example "hello python" you would encode it as: 1,0,0,1

score 0 · Accepted Answer · answered Nov 22 '16 at 07:09

In most cases, you can use a method called Bag of Word, however, in some cases when you are performing more complicated task like similarity extraction or want to make comparison between sentences, you should use Word2Vec

Bag of Word

You may use the classical Bag-Of-Word representation, in which you encode each sample into a long vector indicating the count of all the words from all samples. For example, if you have two samples:

"I like apple, and she likes apple and banana.",

"I love dogs but Sara prefer cats.".

Then all the possible words are(order doesn't matter here):

I she Sara like likes love prefer and but apple banana dogs cats , .

Then the two samples will be encoded to

First:  1 1 0 1 1 0 0 2 0 2 1 0 0 1 1
Second: 1 0 1 0 0 1 1 0 1 0 0 1 1 0 1

If you are using sklearn, the task would be as simple as:

from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
corpus = [
    'This is the first document.',
    'This is the second second document.',
    'And the third one.',
    'Is this the first document?',
]
X = vectorizer.fit_transform(corpus)
# Now you can feed X into any other machine learning algorithms.

Word2Vec

Word2Vec is a more complicated method, which attempts to find the relationship between words by training a embedding neural network underneath. An embedding, in plain english, can be thought of the mathematical representation of a word, in the context of all the samples provided. The core idea is that words are similar if their contexts are similar.

The result of Word2Vec are the vector representation(embeddings) of all the words shown in all the samples. The amazing thing is that we can perform algorithmic operations on the vector. A cool example is: Queen - Woman + Man = King reference here

To use Word2Vec, we can use a package called gensim, here is a basic setup:

model = Word2Vec(sentences, size=100, window=5, min_count=5, workers=4)
model.most_similar(positive=['woman', 'king'], negative=['man'])
>>> [('queen', 0.50882536), ...]

Here sentences is your data, size is the dimension of the embeddings, the larger size is, the more space is used to represent a word, and there is always overfitting we should think about. window is the size of the context we are cared about, it is the number of words before the target word we are looking at when we are predicting the target from its context, when training.

I need to download a load of data from PubMed first and plan on doing textual pre-processing, how can I store this first? — Toby, Nov 22 '16 at 17:06
If the data you are downloading is like [this](https://www.ncbi.nlm.nih.gov/protein/BAC80069.1), I assume you can just download them as text file of any format, then when you need it, you can [load them up](http://stackoverflow.com/questions/3925614/how-do-you-read-a-file-into-a-list-in-python) using `file.readline()` to read the texts line by line — xtt, Nov 22 '16 at 18:49

Store textual dataset for binary classification

2 Answers2

Bag of Word

Word2Vec