I have a list containing millions of sentences for which I need embeddings. I am using Flair for this purpose. The problem seems like it should be embarrassingly parallel. But when I try to optimize, I get either no increase in performance, or it simply stalls.
I define my sentences as a simple list of strings:
texts = [
"this is a test",
"to see how well",
"this system works",
"here are alot of words",
"many of them",
"they keep comming",
"many more sentences",
"so many",
"some might even say",
"there are 10 of them",
]
I use Flair to create the embeddings:
from flair.embeddings import SentenceTransformerDocumentEmbeddings
from flair.data import Sentence
sentence_embedding = SentenceTransformerDocumentEmbeddings("bert-base-nli-mean-tokens")
def sentence_to_vector(sentence):
sentence_tokens = Sentence(sentence)
sentence_embedding.embed(sentence_tokens)
return sentence_tokens.get_embedding().tolist()
I tried with both Joblib Concurrent Futures to solve the problem in parallel:
import time
from joblib import Parallel, delayed
import concurrent.futures
def parallelize(iterable, func):
return Parallel(n_jobs=4, prefer="threads")(delayed(func)(i) for i in iterable)
print("start embedding sequentially")
tic = time.perf_counter()
embeddings = [sentence_to_vector(text) for text in texts]
toc = time.perf_counter()
print(toc - tic)
print("start embedding parallel, w. joblib")
tic = time.perf_counter()
embeddings = parallelize(texts, sentence_to_vector)
toc = time.perf_counter()
print(toc - tic)
print("start embedding parallel w. concurrent.futures")
tic = time.perf_counter()
with concurrent.futures.ProcessPoolExecutor() as executor:
embeddings = [executor.submit(sentence_to_vector, text) for text in texts]
toc = time.perf_counter()
print(toc - tic)
The Joblib function is running, but it is slower than doing it sequential. The concurrent.futures function spins up a bunch of threads but hangs indefinitely.
Any solutions or hints in the right direction would be much appreciated.