2

in using watson personality insights API, i've already note some odd trends, including many scored at a mean value across dimensions (e.g. agreeableness with many around .27), making me thing it's imputing to something.

Upon review I've note a language misalign issue (i.e. if it thinks it's english, you could get weird results if it's, say spanish), which has lead me to ask, but not find answer to:

how does watson handle: 1) urls in the message (e.g. many twitter posts have urls) 2) repeat posts (many channels repeat post things many times) 3) special characters (many posts have a ton of random special characters)

My goal is to determine how much pre-processing I need to do to make watson most effective.

nerdlyfe
  • 487
  • 7
  • 21

1 Answers1

3

You are correct that if the language is mis-aligned then you will get incorrect results.

The Pi API determines the language first from the content-language header. If that is missing then if the content-type is json, then it looks at the language in the json content, selecting the language that has the highest number of occurrences, and finally, if that is missing it will default to the default language, namely English.

So in short, the recommendation (which will become required in a future update), is to always send in the content-language header.

Secondly, to your question on the content: - URLs: the service will attempt to remove these. I won't guarantee that it removes every possible option as the url spec has some very esoteric options but we will remove the common formats. - Repeat Posts: if you send in the same post twice, then it will be counted twice. We do no de-duplication in the text that is sent into the service. - Special Characters; I'm assuming you are referring to emojis here. These are included in our processing as the underlying models were trained on data that included them as well, and thus they are one of the many signals the service uses.

Neil Boyette
  • 171
  • 2
  • question, given that you have specific the language, how is it that watson handles slang? i assume that slang get's "conditionally" parameterized given the language you specify? (i.e "dude", or "homie" will get parameterized differently, correctly, if you say the text is in spanish or english?). More curious about the philosophy of needed language identification by the provider. – nerdlyfe Mar 31 '17 at 18:29
  • 1
    regarding the emojis, is it a utf-8 form, or the raw generation via punctuation. this :) vs ☹ Overall, the question is how should we be pre-processing our data to optimality use watson. – nerdlyfe May 19 '17 at 23:26
  • 1
    The Watson Personality Insights service is trained on a wide variety of twitter data, and thus you are correct, that slang will be treated according to the language specified. At a high level; each trait in each language has its own model and thus if the training data included some slang, that would get picked up by the models for that language. – Neil Boyette May 22 '17 at 14:28
  • 1
    Regarding emojis, you do not need to do any pre-processing. Both forms may be used by the service if it finds a relevant signal (aka you can send in either and they may get used if the model finds a correlation between it and the trait). – Neil Boyette May 22 '17 at 14:30
  • This is great info. thank you for the added clarity. This is a very interesting field, we should chat sometime! – nerdlyfe May 22 '17 at 20:55