I have tried to make tf-idf embeddings but my corpus isn't small. the amount I would use is about 300~500k and the max lenght of input I would set is 450. I got to know that I can deal with large sparse matrix by sklearn's HashingVectorizer but I expect that it would take quite a lot of time. After I became to know cuml library I thought that it's what I want!. However as far as I know now, there is no parameter in tf-idf vectorizer in cuml, which means I can't use my custom tokenizer. Because my corpus is korean document, I need to use custom tokenizer or find the way to deal with this problem in preprocessing step.
can you give me some advice?
I have looked up on the google and looked through RAPIDS' documentation and code but I can't stil solve the problem ㅜㅜ