20

This is a real issue that applies on tagging items in general (and yes, this applies to StackOverflow too, and no, it is not a question about StackOverflow).

The whole tagging issue helps cluster similar items, whatever items they may be (jokes, blog posts, so questions etc). However, there (usually but not strictly) is a hierarchy of tags, meaning that some tags imply other tags too. To use a familiar example, the "c#" so tag implies also ".net"; another example, in a jokes database, a "blondes" tag implies the "derisive" tag, similarly to "irish" or "belge" or "canadian" etc depending on the joke's country origin.

How have you handled this, if you have, in your projects? I will supply an answer describing two different methods I have used in two separate cases (actually, the same mechanism but implemented in two different environments), but I am also interested not only on similar mechanisms, but also on your opinion on the hierarchy issue.

tzot
  • 92,761
  • 29
  • 141
  • 204

3 Answers3

7

This is a tough question. The two extremes are an ontology (everything is hierarchical) and a folksonomy (tags have no hierarchy). I have answered this on WikiAnswers, with a reference to Clay Shirky's "Ontology is Overrated" article which claims you should set no hierarchy.

Yuval F
  • 20,565
  • 5
  • 44
  • 69
  • Clay Shirky's article was very interesting. Obviously, the proximity factor (in the database example) was introduced to soften relating terms (an in the article example of 'gay' and 'queer'). – tzot Sep 23 '08 at 13:16
  • 3
    For some reason I couldn't find the link to Clay Shirky's article in the WikiAnswers page. Here it is: http://www.shirky.com/writings/ontology_overrated.html. I liked it too. – Peter V. Mørch Jan 03 '13 at 21:44
4

Actually I would say that it is not so much a hierarchical system but a semantic net with felt distancies between tags meanings. What do I mean: mathematics is closer to experimental physics then to gardening.

Possibility to build such a net: Build pairs of tags and let people judge the perceived distance (using a measure like 1-10, meaning something like [synonyms, alike,...,antonyms], ...) and when searching, search for all tags within a certain distance.

Does a measure have to be equal distance if coming from the oposite direction ([a,b] close -> [b,a,] close)? Or does proximity imply [a,b] close and [b,c] close -> [a,b] close?

Maybe the first word will by default trigger another semantic field? If you start at "social worker", "analyst" ist near. If you start at "programmer", "analyst" is near as well. But starting at any of these points, you probably would not count the other as near ("sozial worker" is by no means close to "programmer").

You therefore would have only pairs judged and judged in both directions (in random order).

[TagRelations]
tagId integer
closeTagId integer
proximity integer

Example for selection of similar tags:

select closeTagId from TagRelations where tagId = :tagID and proximity < 3
Ralph M. Rickenbach
  • 12,893
  • 5
  • 29
  • 49
  • The proximity is one-way; if it should be two-way, then a different record with a different proximity would be inserted. – tzot Sep 23 '08 at 13:00
  • @malach: As a UX issue, regarding the use of hierarchy, I would say that (i) the software should use the semantic net approach you described based on mathematics, but (ii) users who want to do "gardening" on their personal tag collections should be *allowed to*, but not *forced to* arrange tags into hierarchies because *some* users will feel it's more comfortable than a flat list. In software systems where "personalization" of tags is out of question, a flat list could be used unless domain experts ruled otherwise. – rwong May 10 '11 at 08:08
2

The mechanism I have implemented was to not use the tags given themselves, but an indirect lookup table (not strictly DBMS terms) which links a tag to many implied tags (obviously, a tag is linked with itself for this to work).

In a python project, the lookup table is a dictionary keyed on tags, with values sets of tags (where tags are plain strings).

In a database project (indifferent which RDBMS engine it was), there were the following tables:

[Tags]
tagID integer primary key
tagName text

[TagRelations]
tagID integer # first part of two-field key
tagID_parent integer # second part of key
trlValue float

where the trlValue was a value in the (0, 1] space, used to give a gravity for the each linked tag; a self-to-self tag relation always carries 1.0 in the trlValue, while the rest are algorithmically calculated (it's not important how exactly). Think the example jokes database I gave; a ['blonde', 'derisive', 0.5] record would correlate to a ['pondian', 'derisive', 0.5] and therefore suggest all derisive jokes given another.

tzot
  • 92,761
  • 29
  • 141
  • 204