0

I have tried to make tf-idf embeddings but my corpus isn't small. the amount I would use is about 300~500k and the max lenght of input I would set is 450. I got to know that I can deal with large sparse matrix by sklearn's HashingVectorizer but I expect that it would take quite a lot of time. After I became to know cuml library I thought that it's what I want!. However as far as I know now, there is no parameter in tf-idf vectorizer in cuml, which means I can't use my custom tokenizer. Because my corpus is korean document, I need to use custom tokenizer or find the way to deal with this problem in preprocessing step.

can you give me some advice?

I have looked up on the google and looked through RAPIDS' documentation and code but I can't stil solve the problem ㅜㅜ

Tae-su
  • 1
  • 2
  • 1
    cuML's Tfidfvectorizer does not support custom tokenizers. It's much harder to efficiently support arbitrary callable functions that process strings on the GPU. If this is important for you, could you please comment on [this Github issue](https://github.com/rapidsai/cuml/issues/5104)? – Nick Becker Feb 24 '23 at 14:53

0 Answers0