3

I have a task ahead of me which relies on interpreting structure of a text – to be precise, a monolingual dictionary. The dictionary has quite complex entries: up to 29 unique elements, and some are nested within others. I am designing my own XML schema for the dictionary, but I would like to write a program that parses the plain text I have automatically.

I have some basic skills in Ruby and I am a rather experienced RegEx user, but I think creating lots of if-trees and extremely long RegEx formulas is prboably not the best idea. I have found some information on Parsing Expression Grammar, Backus Normal Form and W-grammar, but it seems somewhat vague to what they apply best.

My question is: which is best way to interpret the structure of a text written in a natural language? I don't want to interpret the language itself, but rather to divide each entry into segments based on characters and keyword used, as well as their neighborhood. What gems and resources would you suggest?


Edit: here's an example of a moderately simple entry from the dictionary (in Polish). What I want to do is to tag each element (senses, explanations, collocation, label markers etc.). As you can see, I am looking for an efficient way to encompass a large number of cases in a tree-like form. Another problem is that I want to have lots of captures, as I want to tag the segments in XML from bigger to smaller.

MrVocabulary
  • 597
  • 1
  • 11
  • 31
  • 1
    Can you post (or link to) those 29 entries. I think I could write a regex for them. The idea is to write _all_ the permutations as strings. Then create a _ternary-tree to regex trie_. See this example http://www.regexformat.com/default_files/Rx5_ScrnSht01.jpg. It is a dictionary, but works pretty good with normal strings. –  Jul 11 '15 at 17:52
  • Well, I have more entries – thousands, to be exact – it's that they are composed of up to 29 distinct elements. The problem is that not all of them are always present, they sometimes shift their order, and there can be an entry embedded within another entry. As much as I appreciate the offer, I am convinced there are too many possibilities for any single regex to be efficient. I posted some additional info in the original post. Could you explain, however, what you meant by "write all the permutations as strings"? To generate every possible combination, even if it's kilometers long? – MrVocabulary Jul 11 '15 at 18:05
  • I am not sure I understand your question – what is my goal? I am digitizing a historic dictionary of Polish and I want to be able to get to specific types of information (I intend to convert the text into XML and then XML into a database with very specific queries for linguistic research). – MrVocabulary Jul 11 '15 at 18:18
  • `digitizing a historic dictionary` Like words as used in sentences? –  Jul 11 '15 at 18:24
  • Okay, looks interesting, I will try it out – thanks for the suggestion. – MrVocabulary Jul 11 '15 at 18:27
  • Well, I want to tag a headword and the words given as synonymes, I want to tag the grammar part and the endings listed, each sense, and what elements/explanations/collocations are used in the dictionary. I will certainly need to extract the list of headwords, the list of collocations, list of grammatical info etc., and I do plan to create a front-end, probably a website, for people to just browse it. So far even segmenting the post-OCR text into separate entries has proven difficult – it's a wall of text at the moment. – MrVocabulary Jul 11 '15 at 18:36
  • Looks like a daunting task.. –  Jul 11 '15 at 18:46
  • Thanks. Not my first dictionary, but first of such size and complexity. I will try to see how could this program aid my task. – MrVocabulary Jul 11 '15 at 19:25
  • Yes, it did – trying to see how it works. Looks complicated. – MrVocabulary Jul 11 '15 at 20:20
  • Take it slow, you'll get used to it. –  Jul 11 '15 at 20:26
  • 1
    You sound like you need a parser; doesn't sound terribly complex, assuming you can write a grammar for what you have. See http://stackoverflow.com/questions/2245962/is-there-an-alternative-for-flex-bison-that-is-usable-on-8-bit-embedded-systems/2336769#2336769 – Ira Baxter Jul 12 '15 at 04:37
  • Thanks, will definitely give it a try! – MrVocabulary Jul 12 '15 at 09:12

1 Answers1

1

This looks like a problem that would be well suited for Treetop. I don't think I have enough information to be sure that it will work, but being able to combine regular expressions into a larger structure where each of the 29 elements can be managed and their information extracted/represented using any of Ruby's features as appropriate, seems like the sort of feature set you need.

gymbrall
  • 2,063
  • 18
  • 21
  • Can't say for sure either, but it looks like it might be just the way to do it. Will definitely try it out, thanks! – MrVocabulary Jul 12 '15 at 09:14