I'm editing some text directly from OCR engine and in some paragraphs the OCR engine ignores the opening and closing quotes. I prefer editing in HTML mode and as a result end up with some text like:
<p>“Wait a moment,” Jacey said. The street light lit up his aged, rat face. Who’s on the move?”</p>
Notice the missing “
.
Another sentence:
<p>“He said he’ coming afer you,” Harry said, and he’ bringing the boys too!”</p>
I use this regex : ([>\.\,])(.*?)”
which seems to do the job for the second sentence but not for the first. This is because the regex is matching from left to right and so matched the extra sentence The street light lit up his aged, rat face.
which should not be within the quotes.
I was thinking that the problem can be solved if the matching was done from right to left. I know this is an option available in C# but I'm using the regex engine of text-based editors to edit a simple text file. Is there a way to locate just the last sentence before the “
, which is the sentence Who’s on the move?
.
[EDIT]
I have been trying using the lookbehind regex: (?<=(?:\. |, |>)(\w)(.*?))(”)
which seems to find all sentences with missing open quotes, “
, but the problem is I cannot replace the contents inside the (?<=)
construct with \3“\1\2\3
because lookbehind is 0 length. Instead the text is just duplicated. For example with the above regex the sentence Who’s on the move?”
becomes Who’s on the move?”“Who’s on the move?”
Any ideas will be appreciated. Thanks