2

I am trying to make a kind of data miner with python. What I am about to examine is a dictionary of the Greek language. The said dictionary was originally in PDF format, and I turned it into a rougly corresponding HTML format to parse it more easily. I have done some further formating on it, since the data structure was heavily distorted.

My current task is to find and seperately store the individual words, along with their descriptions. So the first thought that came to mind about that, was to identify the words first, apart from their descriptions. The headers of the word's space has a very specific syntax, and I use that to create a corresponding regular expression to match each and every one of them.

There is one problem though. Despite the formatting I have done to HTML so far, there are still many points where a series of logical data is interrupted by the sequence < /br> followed by a newline, with random order. Is there any way to direct my regular expression to "ignore" that sequence, that is to treat that certain sequence as non-existent, when met, and therefore including those matches which are interrupted by it?

That is, without putting a (< br/>\n)? in every part of my RE, to cover every possible case.

The regular expression I use is the following:

(ο|η|το)?( )?<b>([α-ωάέήίόύώϊϋΐΰ])*</b>(, ((ο|η|το)? <b>([α-ωάέήίόύώϊϋΐΰ])*</b>))*( \(.*\))? ([Α-Ω])*\.( \(.*\))?<b>:</b>  

and does a fine job with the matching, when the data is not interrupted by the sequence given above.

The problem, in case not understood, lies in that the interrupting sequence can occur anywhere within the match, therefore I am looking for a way other than covering every single spot where the sequence might occur (ignoring the sequence in deciding whether to return a match or not), as I explained earlier.

Noob Doob
  • 1,757
  • 3
  • 19
  • 27
  • Have you tried removing the br before doing the regex search? `myDocument = myDocument.replace("", "")`? – Kevin Dec 29 '14 at 15:28
  • That is a solution. Still, if there is an answer to what I am asking, I would have the solution ready right away, and moreover I imagine there will (generally) be cases where one would like to ignore some specific sequence without altering the text given, so I believe it's worth researching that possibility. – Noob Doob Dec 29 '14 at 15:33
  • seems similar to this http://stackoverflow.com/questions/2078915/a-regular-expression-to-exclude-a-word-string – aberna Dec 29 '14 at 15:45
  • I don't want to drop the match, if it is interrupted. I don't want to "exclude" a string. I want to accept the match, whether it contains the interrupting sequence or not. That is, to direct the RE to treat the sequence, as if it did not exist, and not to use it to decide whether to return a match or not. The problem lies in that the interrupting sequence can be placed anywhere within the match, and not in a specific position. – Noob Doob Dec 29 '14 at 15:50
  • The best would be to give 4 or five lines with this kind of interupts. But if I understood well, you have to add `[<\/br>\n]` to your matching character classes which are not a `.`. I'm pretty sure removing the `\n` as @Kevin said is the best option to get proper matches as output. – Tensibai Dec 29 '14 at 16:35
  • And judging from the answer and what I searched, that seems to be the case. – Noob Doob Dec 29 '14 at 17:10

1 Answers1

1

What you're asking for is a different regular expression.

The new regular expression would be the old one, with (<br\s*?/>\n?)? or the like after every non-quantifier character.

You could write something to transmute a regular expression into the form you're looking for. It would take in your existing regex and produce a br-tolerant regex. No construct in the regular expression grammar exists to do this for you automatically.

I think the easier thing to do is to permute the source document to not contain the sequences you wish to ignore. This should be an easy text substitution.

If it weren't for your explicit use of the <b> tags for meaning, an alternative would be to just take the plain-text document content instead of the HTML content.

Borealid
  • 95,191
  • 9
  • 106
  • 122