I am trying to calculate Document embedding as average of word embedding. I use Glove word embedding. I have tokenized the documents.
def avg_word_emb(documents):
docs_array = np.zeros((len(documents)))
for i, doc in enumerate(documents):
array = np.zeros(len(doc))
for idx, word in enumerate(doc):
vec = word_emb[word]
mean = np.mean(vec, axis=0)
array[idx] = mean
avg_vec = np.sum(array,axis=0)/len(array)
docs_array[i] = avg_vec
return docs_array
training_array = avg_word_emb(df_training['final_doc'])
training_y = df_training['label'].to_numpy()
test_array = avg_word_emb(df_test['final_doc'])
test_y = df_test['label'].to_numpy()
df_train['final'] is a dataframe with tokenized sentences(documents) and I get the word from this and use this to fetch the corresponding word vector in the embedding of Glove. There is around thousands of sentences. Using the word embeddings, the representation of each document is defined as the mean of the vectors of each document's words. In particular, given the document $d$, consisting of words $\left[ v_1, v_2, ..., v_{|d|} \right]$, the document representation $\mathbf{e}_d$ is defined as:
$\math{e}d = \frac{1}{|d|}\sum{i=1}^{|d|}{\mathbf{e}_{v_i}}$
where $\mathbf{e}_{v}$ is the vector of the word $v$, and $|d|$ is the length of the document.
When I use the word embedding representation for classification I get an accuracy of 20%, which is very low. I get 70-80 % accuracy without this embedding representation. Is my word embedding false? Are there other reasons? Any help is appreciated!