5

I want to try and determine the characteristics of a user's personality based on the words they input into a search box. Here's an example:

Search term: "computers"

Personality/descriptors detected: analytical, logical, systematic, methodical


I understand that this task is extremely non-trivial. I have used WordNet before, but I'm not sure if it includes adjective clouds for each noun node. Part-of-speech tagging is a beast of its own, so I'm not sure that building my own corpus and searching for adjective term-frequencies that coexist with keywords is the best idea, but I'll explain it below.

I am currently working with a Wikipedia dump, processing each article for term frequency after having removed stop words (and, or, of, to, a, etc...). My thought was to possibly search for the coexistence of adjectives (using WordNet for POS tagging) and nouns throughout the corpus (eg. the adjective logical often co-occurs with the noun computer), and, based on the relative, stemmed-adjective frequency, judge it to be semantically related to the noun or not. The potential applications are immense.


Another idea is to stem the noun, search for adjectives that begin with that stem, then search for synonyms of that adjective. Example:

Search term: "computers"

Stem: "comput-"

Adjectives with stem: computational

Synonyms: ???


The problem is that adjective forms of nouns don't always have adjective forms, and some noun stems will match to horribly wrong adjectives. *BAD*example:

Search term: "running" (technically a gerund, but still a noun)

Stem: "run-"

Adjectives with stem: runny

Synonyms: NOT THE WORDS I WANT. Would like to find words like "athletic", "motivated", "disciplined"


Is this something that has been done before? Do you have suggestions regarding how I might approach this? It's almost as if I'm seeking to generate adjective clouds for the "important" words in a document.

EDIT: I realize that there is no "correct" answer to this problem. I will reward the bounty to whomever presents a method with the best theoretical potential.

Jon
  • 3,154
  • 13
  • 53
  • 96

2 Answers2

1

WordNet doesn't have what you need - it contains (almost) no information about relation between words that aren't synonyms or aren't linked hierarchically (chair->furniture) etc.

Just use OpenNLP (http://opennlp.apache.org) and parse large amounts of text - OpenNLP parser will detect verb-adjective / noun-adjective in sentences allowing you to build a relations database. All that is left at this point is to filter the database against predefined list of adjectives.

c2h5oh
  • 4,492
  • 2
  • 23
  • 30
  • Interesting. I hadn't come across this. Could you clarify what you mean regarding filtering the database against a list of adjectives? – Jon Jul 19 '12 at 23:00
  • Building noun-adjective database from large quantities of text will leave you with a very large number of adjectives per noun. Granted, in many cases there will be a number of high frequency ones, but long tail will be largely unusable, unless you match it against a predefined list of adjectives you are interested in: using your example - keep personality traits, drop physical description ("tall", "athletic" etc). – c2h5oh Jul 19 '12 at 23:18
  • Well that's kind of what I was alluding to in my original post - using an acceptance threshold for adjective frequencies (ie. only accept adjectives that co-exist with nouns more than X number of times). Perhaps I should start with building a general personality trait adjective list (a couple hundred) and then searching for these. I'm going to leave this open for a little while longer to see what other people might suggest, but this is a good starting point. – Jon Jul 20 '12 at 00:07
  • The main advantage of using OpenNLP is that it "understands" sentences - it is able to match adjective to the noun they refer too, not just detect all adjectives in the sentence. This will significantly reduce noise in relations database - you still should use an acceptance threshold to get rid of false (or coming from bizarre text ;-) ) matches. – c2h5oh Jul 20 '12 at 10:59
1

Assuming you have some hefty computational resources to throw at this, I would suggest using something simple like Hyperspace Analog of Language (HAL) to build up a Term X Term matrix for your dump of Wikipedia. Then, your algorithm could be something like:

  • Given a query word/term, find it's (HAL) vector.
  • For the vector, find the adjective components with the highest weights.
    • To do this efficiently, you would probably want to us a dictionary (like WordNet) to preprocess your list of terms (i.e., those extracted by HAL) such that you know (prior to processing queries) which ones could be used as adjectives.
  • For each adjective, find the N most similar vectors in your HAL space.
    • Optional: You could narrow this list down by looking for words that co-occur across your search terms.

This approach basically trades off memory and computational efficiency for simplicity in terms of code and data structures. Yet, it should do pretty well for what I think you want. The first step will give you adjectives that are most commonly associated with the query term, while the vector similarity in the HAL space (step 3) will give words that are paradigmatically related (roughly, can be substituted for one another, so if you start with an adjective of a certain sort, you should get more adjectives "like it" in terms of its relationship with the query term), which should be a fairly good proxy for the "cloud" you are looking for.

Turix
  • 4,470
  • 2
  • 19
  • 27
  • This sounds pretty good. I'm working with a couple other guys, and our computational resources aren't as impressive as they would need to be to accomplish something like this, but we have toyed with the notion of using some of Amazon's high performance computer clusters for more intensive processing like this. Our machine just finished a 16-hour run to process an entire Wikipedia dump and compute term frequencies on each word in the raw data. 20GB dumb -> 100GB in database. – Jon Jul 23 '12 at 18:13
  • @Jon Thanks for the bounty! Well 16 hours still beats the speed from when I was attempting to do something similar ~10 years ago. Good luck! – Turix Jul 24 '12 at 00:07