0

I'm searching for a way to find some specific patterens in text. For example if I would like to find all references from a text in format like this:

  • Baron, Naomi (2000) Alphabet to Email: How Written English Evolved and Where It's Heading, Routledge: London and New York.

So anything similar to this would be returned from input text. Is there any algorithm that is good with this. All I found so far was algorithm for searching similar strings in text.

I was thinking about using regular expressions, but I don't know if it is the best way to do it, because I would need something that would calculate some index of similarity and would then return hits that have best score.

MaticDiba
  • 895
  • 1
  • 11
  • 19
  • What language? Sounds like you need something like [Sphinx](http://sphinxsearch.com/) – Christian Jun 19 '12 at 08:40
  • 1
    Describe what you are looking for better - try to describe the [grammer](http://en.wikipedia.org/wiki/Formal_grammar) with more then example. Once you do it - it'll be clear if regex is enough, or maybe you need a [context-free](http://en.wikipedia.org/wiki/Context-free_language) parser (and which: [LR? SLR?](http://en.wikipedia.org/wiki/LR_parser) maybe [LL](http://en.wikipedia.org/wiki/LR_parser)?) – amit Jun 19 '12 at 08:59
  • Parsing these strings are computing similarity are two distinct tasks. – Fred Foo Jun 19 '12 at 09:07

1 Answers1

0

The technique your are looking for is called Information Extraction.

Here is my answer to a similar question:

How does Apple find dates, times and addresses in emails?

You might need to combine some Named Entity Recognition too. http://en.wikipedia.org/wiki/Named-entity_recognition

Community
  • 1
  • 1
Neil McGuigan
  • 46,580
  • 12
  • 123
  • 152