1

I hope you guys are doing great.

I'm training a classifier with Facebook's FastText to determine if a piece of text (tweet) is talking about economy or not. For doing this task, I have about 2200 tagged tweets as "economy" or "not_economy", but I also have almost a million unlabeled tweets.

Reading FastText's documentation I know the supervised input file should be a document with a tweet per line with a prefix of the shape __label__economy or __label__not_economy.

The documentation doesn't talk about adding unlabeled documents to the unsupervised input file, but since it's a word embedding model, it's supposed to take context information from the word's text distribution, so I think giving the model all this extra information should help getting a better embedding representation of my vocabulary. For this reason I'm training the model (with fasttext supervised -input tweets_input -output tweets_model) but I'm also adding untagged documents at the end. The things is that all these almost 1M tweets doesn't seem to be enhancing the model at all.

The other way I know I can take advantage of this data is training a unsupervised model and start using the sentence embedings to train a classifier.

The question is the one in the title:

Do documents without labels add information to Facebook's FastText supervised classifier? Is it better to get the document embeddings and train my own classifier with other library?

Thanks for any information that helps me understand better.

Leandro D.
  • 93
  • 3
  • 6
  • 1
    Doesn't it expect a label to be declared at the beginning of each text line? How are you feeding these unlabeled docs to the algorithm? (Are you sure they're not being treated as texts with some default/plug/empty-string label?) – gojomo Oct 08 '21 at 04:46
  • Yes, they expect a `__label__` at the beginning, but I just didn't put any label in them. Then I tried printing all the labels using `model.labels` and I'm getting only two: `economy` and `not_economy` – Leandro D. Oct 10 '21 at 16:02
  • 1
    I suspect, then, that the code is ignoring those lines entirely. The `--supervised` mode replaces the usual `word->word` predictive neural network used in training with just `word->label`. With no labels, each text would just be a training no-op. – gojomo Oct 10 '21 at 17:24

1 Answers1

2

You can't use untagged documents to train the supervised model, because they lack labels.

You can try this idea: