Does stemming harm precision in text classification?

Question

I have read stemming harms precision but improves recall in text classification. How does that happen? When you stem you increase the number of matches between the query and the sample documents right?

I am unsure whether it will make a difference for the answers, but at least to me it is not clear whether you refer to information retrieval (given that you mention _queries_), or text classification (given that that is mentioned in the title). — jogojapan, Apr 29 '12 at 11:02

score 13 · Accepted Answer · edited Jan 22 '13 at 02:25

It's always the same, if you raise recall, your doing a generalisation. Because of that, you're losing precision. Stemming merge words together.

On the one hand, words which ought to be merged together (such as "adhere" and "adhesion") may remain distinct after stemming; on the other, words which are really distinct may be wrongly conflated (e.g., "experiment" and "experience"). These are known as understemming errors and overstemming errors respectively.

Overstemming lowers precision and understemming lowers recall. So, since no stemming at all means no over- but max understemming errors, you have a low recall there and a high precision.

Btw, precision means how many of your found 'documents' are those you were looking for. Recall means how many of all 'documents', which were correct, you received.

So if someone is doing stemming what are the things he will be expecting other than reducing the size of the dictionary? — samsamara, Jun 25 '12 at 06:44

score -1 · Answer 2 · answered Apr 29 '12 at 03:38

-1

From the wikipedia entry on Query_expansion:

By stemming a user-entered term, more documents are matched, as the alternate word forms for a user entered term are matched as well, increasing the total recall. This comes at the expense of reducing the precision. By expanding a search query to search for the synonyms of a user entered term, the recall is also increased at the expense of precision. This is due to the nature of the equation of how precision is calculated, in that a larger recall implicitly causes a decrease in precision, given that factors of recall are part of the denominator. It is also inferred that a larger recall negatively impacts overall search result quality, given that many users do not want more results to comb through, regardless of the precision.

answered Apr 29 '12 at 03:38

ditkin

6,774
1
35
37

2

I can't see how Wikipedia's reasoning is correct here. If there were b documents in the search result _before_ stemming, and a of them were relevant, precision was a/b. Now if by virtue of stemming c documents are added to the result set, and assume that all of them are actually relevant, then precision becomes (a+c)/(b+c). Since a<=b, this is larger than the original precision a/b. – jogojapan Apr 29 '12 at 10:58
1

I think Wikipedia is right. I'm focusing in particular IR, I'm not so sure if this applies to text classification. Consider the query "news", if the stemmer overstems it as "new" then recall is certainly maintained or even higher, but precision will certainly be affected (since 'news' and 'new' share the same stem, assuming such stemmer does it that way. The cases involve overstemming/incorrect stemming, but also ambiguous cases. With stemming, recall is likely increased or maintained, but precision is likely decreased or maintained. – Kenston Choi Apr 30 '12 at 04:03
1

@Kenston What you say is right, but what Wikipedia says is still wrong. You are talking about an instance of stemming that introduces ambiguity. That certainly reduces precision. But Wikipedia claims that any increase in recall _must_ imply a decrease in precision due to the way it is defined (_"the nature of the equation"_). That is wrong. If, as a result of stemming, only (or mostly) relevant documents are added to the result set, precision will not decrease. It can even increase. – jogojapan May 01 '12 at 06:31
1

@jogojapan, I agree! It seems that Wiki article implied that increase in recall would necessarily drag precision down. While it's something we need to expect/watch out for, it only happens depending on certain cases, and the possibility of an increase in precision can even happen. – Kenston Choi May 01 '12 at 10:22

Does stemming harm precision in text classification?

2 Answers2

Linked