2

I'm aware that this is kind of a general, open-ended question. I'm essentially looking for help in deciding a way forward, and perhaps for some reading material.

I'm working on an algorithm that does unstructured text mining, and trying to extract something specific - the names of bands (single artists, bands, etc) from that text. The text itself has no predictable structure, but it is relatively small (1, 2 rows of text).

Some examples may be (not real events):

Concert Green Day At Wembley Stadium
Extraordinary representation - Norah Jones in Poland - at the Polish Opera

Now, I'm thinking of trying out a classifier but the text seems to small to provide any real training information for it. There probably are several other text mining techniques, heuristics or algorithms that may yield good results for this kind of problem (or perhaps no algorithm will).

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
Eugen
  • 8,523
  • 8
  • 52
  • 74
  • 1
    you were right, as-is this question is probably too open-ended for SO. I suggest you search SO, and the web at large, with keywords like `Named entity recognition/extraction`, `NER` etc. as this will provide you some more precise ideas as to the practices and challenges in this domain. Although not a duplicate, this SO Question: http://stackoverflow.com/questions/1643616/algorithms-to-detect-phrases-and-keywords-from-text may be a good place to start. – mjv Jul 12 '11 at 20:33
  • Let me get this straight: do you have a list of bands you're looking for, or are you looking for band names in general? – Fred Foo Jul 13 '11 at 19:55

2 Answers2

2

Because of the structure of your data a pre-trained model will probably perform poorly. Besides, the general organization, location, and person categories will probably not be useful for you.

I don't think the text themselves are too small, most NER-systems work on one sentence at a time. So providing your own training set with a NER-library will probably work well, such as http://nlp.stanford.edu/ner/index.shtml

If you don't want to create a training set you will need a dictionary with all the bands/artists. Then you obviously can't find unknown bands/artists.

Rasmus
  • 103
  • 7
  • I haven't tried, but I suspect NER taggers may run into trouble on sentences like `Concert Green Day At Wembley Stadium` due to the number of capitals. But if they do, then their output can be fed to a classifier (or a simple list of bands extracted from the Wikipedia). – Fred Foo Jul 13 '11 at 19:57
  • Yes, I think they will too. But not if they're trained on a custom data sets with capitals like those present. – Rasmus Jul 14 '11 at 10:51
0

There is simple NER algorithm that could simplify the task a bit: take the words which may be (or not be) named entity and search for them in Google or Yahoo (via API) twice: as separate words and as exact phrase (i.e. with quotation marks). Divide numbers of results. There is threshold (<30) which determines if words form a named entity.

jaboja
  • 2,178
  • 1
  • 21
  • 35