Parsing textfile with citations

Question

The issue is that I’m trying to segment the text file by sentences using php. I currently using the following function:

$results = preg_split('/(?<=[.?!])\s+/', $stringtest, -1, PREG_SPLIT_NO_EMPTY);

The issue is that with sentences like these:

In his book The Symposium, Plato wrote “Those who are halves of a man whole pursue males, and being slices, so to speak, of the male, love men throughout their boyhood, and take pleasure in physical contact with men” (qtd. in Isay 11).

It splits it like this:

[0] In his book The Symposium, Plato wrote “Those who are halves of a man whole pursue males, and being slices, so to speak, of the male, love men throughout their boyhood, and take pleasure in physical contact with men” (qtd. 
[1] in Isay 11).

Another example is:

Dr. Evelyn Hooker, a heterosexual psychologist...

The Dr. part would be an issue.

These texts are all from the MASC corpus for NLP.

@JayBlanchard: I guess OP wishes to split on punctuation. But since they are also present elsewhere it's causing trouble. — Rahul, Apr 24 '17 at 19:16
@WiktorStribiżew One of the issues is that the answer to the duplicate question does not account for instances such as: Dobbens reasoned that most parents would not raise their children to be homosexual; “They’re not like ‘My child’s going to be gay!”’ (Dobbens). It splits it into ".....gay!'" and (Dobbens). And I can't ask follow up questions there because I don't have enough points. — 39fredy, Apr 24 '17 at 19:53
[It does not split the sentence at all.](http://ideone.com/hhasAk) What do you expect? I just added `qtd` to the list of abbrevs - http://ideone.com/cIBdr0 - and all works well. — Wiktor Stribiżew, Apr 24 '17 at 20:02
You should explain what you expect for each case in the question itself. — Wiktor Stribiżew, Apr 24 '17 at 20:07
Sorry, the problem only arises when there is another sentence after the "(Dobbens)." example: http://ideone.com/qJ1BfC Its the third example. The second element in the array should be part of the first ( i.e. the citations Dobbens should be part of the first sentence) — 39fredy, Apr 24 '17 at 20:16
But what is the rule? How can you define it verbally? Any `(` followed after a closing citation symbol (or even after a straight double quote) should be part of the preceding sentence? — Wiktor Stribiżew, Apr 24 '17 at 20:18
The citation should be part of the sentence where it is used ( the sentence immediately before the citation). This is particularly important when the sentence before the citation ( the citation being (Dobbens). ) is a quote with punctuation ( in this case '!' ) — 39fredy, Apr 24 '17 at 20:22
Ok, so the rule is: a final punctuation (`[.!?]`) followed with `”` or `’` (or maybe we need to also add `'` and `"`/`»`) and then having `(` should not be split, right? See [this demo](http://ideone.com/QSgPxs). — Wiktor Stribiżew, Apr 24 '17 at 20:25
The issue is that in scenarios like "!”’(Dobbens)." it works because there is no space between the " and the (. But when you put a space it breaks. Is there a way for us to add a regex to fix this? — 39fredy, Apr 24 '17 at 20:28
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/142546/discussion-between-39fredy-and-wiktor-stribizew). — 39fredy, Apr 24 '17 at 20:31
As for the latest regex remark, I tried almost the same code and [got expected results in both cases, with and without spaces](http://ideone.com/wO0Fra) — Wiktor Stribiżew, Apr 24 '17 at 20:43

score 1 · Accepted Answer · edited May 23 '17 at 11:46

You may extend @ndn's solution to achieve what you need. Note that $before_regexes contain a list of known abbreviations, add those that are present in your corpora. I added qtd there.

Then, note that $before_regexes and $after_regexes are paired. I added '/(?:[”’"\'»])\s*\Z/u' / '/\A(?:\(\p{L})/u' pair and marked it as a non-sentence boundary (with the first false in the $is_sentence_boundary array. The regex pair means: find a quotation mark (”’"'»), 0+ whitespaces, and then followed with ( (with \() and any Unicode letter (\p{L}), then there should be no split.

function sentence_split($text) {
    $before_regexes = array('/(?:[”’"\'»])\s*\Z/u',
        '/(?:(?:[\'\"„][\.!?…][\'\"”]\s)|(?:[^\.]\s[A-Z]\.\s)|(?:\b(?:St|Gen|Hon|Prof|Dr|Mr|Ms|Mrs|[JS]r|Col|Maj|Brig|Sgt|Capt|Cmnd|Sen|Rev|Rep|Revd)\.\s)|(?:\b(?:St|Gen|Hon|Prof|Dr|Mr|Ms|Mrs|[JS]r|Col|Maj|Brig|Sgt|Capt|Cmnd|Sen|Rev|Rep|Revd)\.\s[A-Z]\.\s)|(?:\bApr\.\s)|(?:\bAug\.\s)|(?:\bBros\.\s)|(?:\bCo\.\s)|(?:\bCorp\.\s)|(?:\bDec\.\s)|(?:\bDist\.\s)|(?:\bFeb\.\s)|(?:\bInc\.\s)|(?:\bJan\.\s)|(?:\bJul\.\s)|(?:\bJun\.\s)|(?:\bMar\.\s)|(?:\bNov\.\s)|(?:\bOct\.\s)|(?:\bPh\.?D\.\s)|(?:\bSept?\.\s)|(?:\b\p{Lu}\.\p{Lu}\.\s)|(?:\b\p{Lu}\.\s\p{Lu}\.\s)|(?:\bcf\.\s)|(?:\be\.g\.\s)|(?:\besp\.\s)|(?:\bet\b\s\bal\.\s)|(?:\bvs\.\s)|(?:\p{Ps}[!?]+\p{Pe} ))\Z/su',
        '/(?:(?:[\.\s]\p{L}{1,2}\.\s))\Z/su',
        '/(?:(?:[\[\(]*\.\.\.[\]\)]* ))\Z/su',
        '/(?:(?:\b(?:pp|[Vv]iz|i\.?\s*e|[Vvol]|[Rr]col|maj|Lt|[Ff]ig|[Ff]igs|[Vv]iz|[Vv]ols|[Aa]pprox|[Ii]ncl|Pres|[Dd]ept|min|max|[Gg]ovt|lb|ft|c\.?\s*f|vs|qtd)\.\s))\Z/su',
        '/(?:(?:\b[Ee]tc\.\s))\Z/su',
        '/(?:(?:[\.!?…]+\p{Pe} )|(?:[\[\(]*…[\]\)]* ))\Z/su',
        '/(?:(?:\b\p{L}\.))\Z/su',
        '/(?:(?:\b\p{L}\.\s))\Z/su',
        '/(?:(?:\b[Ff]igs?\.\s)|(?:\b[nN]o\.\s))\Z/su',
        '/(?:(?:[\"”\']\s*))\Z/su',
        '/(?:(?:[\.!?…][\x{00BB}\x{2019}\x{201D}\x{203A}\"\'\p{Pe}\x{0002}]*\s)|(?:\r?\n))\Z/su',
        '/(?:(?:[\.!?…][\'\"\x{00BB}\x{2019}\x{201D}\x{203A}\p{Pe}\x{0002}]*))\Z/su',
        '/(?:(?:\s\p{L}[\.!?…]\s))\Z/su');
    $after_regexes = array('/\A(?:\(\p{L})/u',
        '/\A(?:)/su',
        '/\A(?:[\p{N}\p{Ll}])/su',
        '/\A(?:[^\p{Lu}])/su',
        '/\A(?:[^\p{Lu}]|I)/su',
        '/\A(?:[^p{Lu}])/su',
        '/\A(?:\p{Ll})/su',
        '/\A(?:\p{L}\.)/su',
        '/\A(?:\p{L}\.\s)/su',
        '/\A(?:\p{N})/su',
        '/\A(?:\s*\p{Ll})/su',
        '/\A(?:)/su',
        '/\A(?:\p{Lu}[^\p{Lu}])/su',
        '/\A(?:\p{Lu}\p{Ll})/su');
    $is_sentence_boundary = array(false, false, false, false, false, false, false, false, false, false, false, true, true, true);
    $count = 13;

    $sentences = array();
    $sentence = '';
    $before = '';
    $after = substr($text, 0, 10);
    $text = substr($text, 10);

    while($text != '') {
        for($i = 0; $i < $count; $i++) {
            if(preg_match($before_regexes[$i], $before) && preg_match($after_regexes[$i], $after)) {
                if($is_sentence_boundary[$i]) {
                    array_push($sentences, $sentence);
                    $sentence = '';
                }
                break;
            }
        }

        $first_from_text = $text[0];
        $text = substr($text, 1);
        $first_from_after = $after[0];
        $after = substr($after, 1);
        $before .= $first_from_after;
        $sentence .= $first_from_after;
        $after .= $first_from_text;
    }

    if($sentence != '' && $after != '') {
        array_push($sentences, $sentence.$after);
    }

    return $sentences;
}

See the PHP demo.

Parsing textfile with citations

1 Answers1