Alternative to tf-idf for 2 documents?

Question

I am currently trying to construct wordclouds between two tidy text documents (tweets). My question is rather methodological although I am using bind_tf_idfin R for the analysis.

The basic problem is that, by definition, idf would penalise any word that appears in both documents, producing an tdf-if of 0 for them. This is far to strict for my analysis and makes the wordclouds ranked by tf-idf contain unique terms that are not illustrative at all:

My question then is whether there is a way to relax the tf-idf method in this aspect. That is, I'd still like to penalise terms that are very common between my two documents but not to eliminate them completely just because they appear in both.

My intuition would be to play around with the idf formula but there might be another existing method that have considered this problem.

Thanks in advance.

It's easier to help you if you include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions. If this is not a specific programming question and you are just looking to discuss potential analysis methods, then you should instead ask someplace like [stats.se] or [datasciense.se] — MrFlick, Jun 19 '21 at 17:33
Sure thing! I believe the question is rather methodological so I will post it in there. Sorry for bothering! — Luis, Jun 19 '21 at 21:02

Alternative to tf-idf for 2 documents?

0 Answers0