2

I'm editing some text directly from OCR engine and in some paragraphs the OCR engine ignores the opening and closing quotes. I prefer editing in HTML mode and as a result end up with some text like:

<p>&ldquo;Wait a moment,&rdquo; Jacey said. The street light lit up his aged, rat face. Who&rsquo;s on the move?&rdquo;</p>

Notice the missing &ldquo;.

Another sentence:

<p>&ldquo;He said he&rsquo; coming afer you,&rdquo; Harry said, and he&rsquo; bringing the boys too!&rdquo;</p>

I use this regex : ([>\.\,])(.*?)&rdquo; which seems to do the job for the second sentence but not for the first. This is because the regex is matching from left to right and so matched the extra sentence The street light lit up his aged, rat face. which should not be within the quotes. I was thinking that the problem can be solved if the matching was done from right to left. I know this is an option available in C# but I'm using the regex engine of text-based editors to edit a simple text file. Is there a way to locate just the last sentence before the &ldquo;, which is the sentence Who&rsquo;s on the move?.

[EDIT] I have been trying using the lookbehind regex: (?<=(?:\. |, |>)(\w)(.*?))(&rdquo;) which seems to find all sentences with missing open quotes, &ldquo;, but the problem is I cannot replace the contents inside the (?<=) construct with \3&ldquo;\1\2\3 because lookbehind is 0 length. Instead the text is just duplicated. For example with the above regex the sentence Who&rsquo;s on the move?&rdquo; becomes Who&rsquo;s on the move?&rdquo;&ldquo;Who&rsquo;s on the move?&rdquo;

Any ideas will be appreciated. Thanks

zx81
  • 41,100
  • 9
  • 89
  • 105
medwatt
  • 103
  • 1
  • 11

1 Answers1

4

Recursion and Defined Subroutines

The following regex checks that strings are balanced. The code below (see its output in the online demo) checks several strings. The explanations are in the comments.

$balanced_string_regex = "~(?sx)                  # Free-Spacing
(?(DEFINE)            # Define a few subroutines
   (?<double>&ldquo;(?:(?!&[lr]squo;).)*&rdquo;)  # full set of doubles (no quotes inside)
   (?<single>&lsquo;(?:(?!&[lr]dquo;).)*&rsquo;)  # full set of singles (no quotes inside)
   (?<notquotes>(?:(?!&[lr][sd]quo;).)*)          # chars that are not quotes
)                     # end DEFINE

^                       # Start of string
(?:                     # Start non-capture group
   (?&notquotes)        # Any non-quote chars
   &l(?<type>[sd])quo;  # Opening quote, capture single or double type
   # any full singles, doubles, not quotes or recursion
   (?:(?&single)|(?&double)|(?&notquotes)|(?R))*
   &r\k<type>quo;       # Closing quote of the correct type
   (?&notquotes)      # 
)++                   # Repeat non-capture group
$                     # End of string
~";

$string = "&ldquo;He said  &rdquo; &lsquo;He said  &rsquo;";
check_string($string);
$string = "<p>&ldquo;Wait a moment,&rdquo; Jacey said. The street light lit up his aged, rat face. Who&rsquo;s on the move?&rdquo;</p>";
check_string($string);
$string = "<p>&ldquo;Wait a moment,&rdquo; Jacey said. The street light lit up his aged, rat face. &lsquo;Whos on the &ldquo;move?&rdquo; &rsquo;</p>";
check_string($string);
$string = "<p>&ldquo;He said he&rsquo; coming afer you,&rdquo; Harry said, and he&rsquo; bringing the boys too!&rdquo;</p>";
check_string($string);
$string = "<p>&ldquo;He &lsquo;said he&rsquo; coming afer you,&rdquo; Harry said, and he&ldquo; bringing the boys too!&rdquo;</p>";
check_string($string);


function check_string($string) {
    global $balanced_string_regex;
    echo (preg_match($balanced_string_regex, $string)) ?
        "Balanced!\n" :
        " Nah... Not Balanced.\n" ;
}

Output

Balanced!
 Nah... Not Balanced.
Balanced!
 Nah... Not Balanced.
Balanced!

Replacing Missing Quotes

As I've indicated in the comments, IMO replacing missing quotes is hazardous: before or after what word should the missing quote fall? If there was any kind of nesting, can we be sure that we've correctly identified the missing quote? For that reason, if you're going to do anything, my inclination would be to match the balanced portion (hoping it is correct) and remove any extra quotes.

The pattern above lends itself to all kinds of tweaks. For instance, on this regex demo, we match and replace an unbalanced quote. Since this was requested, I'll offer a second potential tweak with some reluctance—this one inserts a missing left quote at the beginning of the phrase preceding the unmatched right quote.

zx81
  • 41,100
  • 9
  • 89
  • 105
  • Thanks. This only checks whether there isn't a complementary opening or closing bracket. It does not attempt to provide a suitable position to put the missing quote. I think I'd have to make myself more clear. During editing, to find all missing quotes, I first eliminate all correctly quoted statements by converting them to ". This is what I do: Search: `([>\s])\&ldquo\;(.*?[^\s])\&rdquo\;([<\s])` Replace: `\1"\2"\3` What is left are those statements with one of the two quotes but not both. Fixing closing quotes is easy because I can search from right to left. – medwatt Jul 24 '14 at 02:46
  • `This only checks` You're funny. That's a really intricate regex. Finding a suitable position is a small tweak from where we are, I'll look at it a bit later. – zx81 Jul 24 '14 at 02:56
  • Hey bro you don't have to accept the answer if it doesn't work for you—at that stage what people use when they find an answer hepful is the "UP" arrow just above the checkmark (it upvotes). I suggest doing that until you're satisfied (you can unaccept), then accept when you're happy. Will look at the insertion problem. :) – zx81 Jul 24 '14 at 03:39
  • Next question: How do you want to balance the unbalanced quote? I can either close it: `&lquo;&rquo;` or simply remove it. Guessing a position is very hazardous IMO, the closing quote could be anywhere. – zx81 Jul 24 '14 at 03:57
  • I've already added some new information to my original post. What I want is to select the last sentence before the closing quote. `(?<=(?:\. |, |>)(\w)(.*?))(”)` will do it. The only problem is there's no way to replace the contents in the lookbehind group. – medwatt Jul 24 '14 at 06:56
  • Matching from right to left is not a problem btw, when I need to do it in PHP (which is rare) I reverse the string. – zx81 Jul 24 '14 at 07:00
  • We don't know if what's missing is the left quote or the right quote, right? That would depend on the string? (That's why initially I was thinking why just not delete the extra quote?) Anyhow, to get the ball rolling, [on this demo](http://regex101.com/r/gM4pI1/1) have a look at the SUBSTITUTIONS at the bottom where I've inserted a missing left quote (variation on the answer above) – zx81 Jul 24 '14 at 07:12
  • Quick note: addes `s` modifier in `(?sx)` in case we're going across lines. – zx81 Jul 24 '14 at 11:53
  • +1 this is not just an answer but a full article on advanced regex :) – anubhava Jul 24 '14 at 13:38