I am currently trying to construct wordclouds between two tidy text documents (tweets). My question is rather methodological although I am using bind_tf_idf
in R for the analysis.
The basic problem is that, by definition, idf would penalise any word that appears in both documents, producing an tdf-if of 0 for them. This is far to strict for my analysis and makes the wordclouds ranked by tf-idf contain unique terms that are not illustrative at all:
My question then is whether there is a way to relax the tf-idf method in this aspect. That is, I'd still like to penalise terms that are very common between my two documents but not to eliminate them completely just because they appear in both.
My intuition would be to play around with the idf formula but there might be another existing method that have considered this problem.
Thanks in advance.