0

I am trying to calculate Document embedding as average of word embedding. I use Glove word embedding. I have tokenized the documents.

def avg_word_emb(documents):
    docs_array = np.zeros((len(documents)))

    for i, doc in enumerate(documents):
        array = np.zeros(len(doc))
        for idx, word in enumerate(doc):
            vec = word_emb[word]
            mean = np.mean(vec, axis=0)
            array[idx] = mean
        avg_vec = np.sum(array,axis=0)/len(array)
        docs_array[i] = avg_vec
    return docs_array

training_array = avg_word_emb(df_training['final_doc'])
training_y = df_training['label'].to_numpy()


test_array = avg_word_emb(df_test['final_doc'])
test_y = df_test['label'].to_numpy() 

df_train['final'] is a dataframe with tokenized sentences(documents) and I get the word from this and use this to fetch the corresponding word vector in the embedding of Glove. There is around thousands of sentences. Using the word embeddings, the representation of each document is defined as the mean of the vectors of each document's words. In particular, given the document $d$, consisting of words $\left[ v_1, v_2, ..., v_{|d|} \right]$, the document representation $\mathbf{e}_d$ is defined as:

$\math{e}d = \frac{1}{|d|}\sum{i=1}^{|d|}{\mathbf{e}_{v_i}}$

where $\mathbf{e}_{v}$ is the vector of the word $v$, and $|d|$ is the length of the document.

When I use the word embedding representation for classification I get an accuracy of 20%, which is very low. I get 70-80 % accuracy without this embedding representation. Is my word embedding false? Are there other reasons? Any help is appreciated!

Gunners
  • 55
  • 5
  • This looks like Python, but you should tag the language that your question is about. Since it seems to be more about the analysis than the code, it might be better suited on [stats.se] or [datascience.se]. Beyond that, reread [ask], since there's currently no example to work from in order to test the accuracy – camille Nov 29 '22 at 20:58

0 Answers0