3

I have a large bunch of text. For example

I want to split a paragraph into sentences. But, there is a problem. My paragraph includes dates like Jan.13, 2014 , words like U.A.E and numbers like 2.2. How do i split this.**

Output:

I want to split a paragraph into sentences.

But, there is a problem.

My paragraph includes dates like Jan.13, 2014 , words like U.A.E and numbers like 2.2.

How do i split this.

This is the output i wanted. Can anybody guide me ho to do this in Swift.

Thanks.

Uzair
  • 231
  • 4
  • 18

6 Answers6

9

Use NSLinguisticTagger. It gets the sentences right for your given input, because it analyzes in actual linguistic terms.

Here's a rough draft (Swift 1.2, this won't compile in Swift 2.0):

let s = "I want to split a paragraph into sentences. But, there is a problem. My paragraph includes dates like Jan.13, 2014 , words like U.A.E and numbers like 2.2. How do i split this."
var r = [Range<String.Index>]()
let t = s.linguisticTagsInRange(
    indices(s), scheme: NSLinguisticTagSchemeLexicalClass,
    options: nil, tokenRanges: &r)
var result = [String]()
let ixs = Array(enumerate(t)).filter {
    $0.1 == "SentenceTerminator"
    }.map {r[$0.0].startIndex}
var prev = s.startIndex
for ix in ixs {
    let r = prev...ix
    result.append(
        s[r].stringByTrimmingCharactersInSet(
             NSCharacterSet.whitespaceCharacterSet()))
    prev = advance(ix,1)
}

Here is a Swift 2.0 version (updated to Xcode 7 beta 6):

let s = "I want to split a paragraph into sentences. But, there is a problem. My paragraph includes dates like Jan.13, 2014 , words like U.A.E and numbers like 2.2. How do i split this."
var r = [Range<String.Index>]()
let t = s.linguisticTagsInRange(
    s.characters.indices, scheme: NSLinguisticTagSchemeLexicalClass,
    tokenRanges: &r)
var result = [String]()
let ixs = t.enumerate().filter {
    $0.1 == "SentenceTerminator"
}.map {r[$0.0].startIndex}
var prev = s.startIndex
for ix in ixs {
    let r = prev...ix
    result.append(
        s[r].stringByTrimmingCharactersInSet(
            NSCharacterSet.whitespaceCharacterSet()))
    prev = ix.advancedBy(1)
}

And here it is updated for Swift 3:

let s = "I want to split a paragraph into sentences. But, there is a problem. My paragraph includes dates like Jan.13, 2014 , words like U.A.E and numbers like 2.2. How do i split this."
var r = [Range<String.Index>]()
let t = s.linguisticTags(
    in: s.startIndex..<s.endIndex,
    scheme: NSLinguisticTagSchemeLexicalClass,
    tokenRanges: &r)
var result = [String]()
let ixs = t.enumerated().filter {
    $0.1 == "SentenceTerminator"
    }.map {r[$0.0].lowerBound}
var prev = s.startIndex
for ix in ixs {
    let r = prev...ix
    result.append(
        s[r].trimmingCharacters(
            in: NSCharacterSet.whitespaces))
    prev = s.index(after: ix)
}

result is an array of four strings, one sentence per string:

["I want to split a paragraph into sentences.", 
 "But, there is a problem.", 
 "My paragraph includes dates like Jan.13, 2014 , words like U.A.E and numbers like 2.2.", 
 "How do i split this."]
matt
  • 515,959
  • 87
  • 875
  • 1,141
  • I received a compiler error in the third line ("s.characters.indices"): String does not have a member named characters. Any hints? – ixany Sep 08 '15 at 20:40
  • @mrtn.lxo Did you notice that one version is for Swift 1.2 and one version is for Swift 2.0? – matt Sep 08 '15 at 20:55
  • just updated to xcode 7 beta 6 and it worked just perfectly fine. thank you so much, matt! great solution. – ixany Sep 10 '15 at 15:50
  • @mrtn.lxo Yeah, that's why I specifically said Xcode 7 beta 6 in my answer. They changed the language quite considerably from beta to beta...! – matt Sep 10 '15 at 16:43
  • Could you please provide an working Swift 3 solution? Thanks in advance! :) – ixany Jan 09 '17 at 16:39
  • @ixany Here you go. – matt Jan 09 '17 at 17:11
  • How can I convert this to swift 4? Thanks – dscrown Apr 17 '18 at 20:55
  • 2
    @dscrown Works fine for me. The only change is `NSLinguisticTagSchemeLexicalClass` to `NSLinguisticTagScheme.lexicalClass.rawValue`. – matt Apr 17 '18 at 21:50
  • One edge case to keep in mind with this solution is that strings that end without a sentence terminator will quietly be dropped from the result. e.g. `"This is a sentence. This is another sentence with no terminator"` outputs `["This is a sentence."]`. This may or may not be the behaviour you need. – Rob MacEachern Dec 19 '19 at 21:57
2

NSLinguisticTagger is deprecated. Using NLTagger instead. (iOS 12.0+, macOS 10.14+)

import NaturalLanguage

var str = "I want to split a paragraph into sentences. But, there is a problem. My paragraph includes dates like Jan.13, 2014 , words like U.A.E and numbers like 2.2. How do i split this."

func splitSentenceFrom(text: String) -> [String] {
    var result: [String] = []
    let tagger = NLTagger(tagSchemes: [.lexicalClass])
    tagger.string = text
    tagger.enumerateTags(in: text.startIndex..<text.endIndex, unit: .sentence, scheme: .lexicalClass) { (tag, tokenRange) -> Bool in
        result.append(String(text[tokenRange]))
        return true
    }
    return result
}

let sentences = splitSentenceFrom(text: str)

sentences.forEach {
    print($0)
}

output:

I want to split a paragraph into sentences. 
But, there is a problem. 
My paragraph includes dates like Jan.13, 2014 , words like U.A.E and numbers like 2.2. 
How do i split this.

want to exclude empty sentences and trim whitespace? add this

let sentence = String(text[tokenRange]).trimmingCharacters(in: .whitespacesAndNewlines)
if sentence.count > 0 {
    result.append(sentence)
}
wye
  • 316
  • 4
  • 7
1

Here is matt answer in swift 4

 func splitsentance(string: String) -> [String]{
    let s = string
    var r = [Range<String.Index>]()
    let t = s.linguisticTags(
        in: s.startIndex..<s.endIndex, scheme:    NSLinguisticTagScheme.lexicalClass.rawValue,
        options: [], tokenRanges: &r)
    var result = [String]()

    let ixs = t.enumerated().filter{
         $0.1 == "SentenceTerminator"
    }.map {r[$0.0].lowerBound}
    var prev = s.startIndex
    for ix in ixs {
        let r = prev...ix
        result.append(
            s[r].trimmingCharacters(in: CharacterSet.whitespacesAndNewlines))
        prev = ix
    }
    return result
}
Afsar edrisy
  • 1,985
  • 12
  • 28
0

This is a rough version of I believe you were looking for: I an running a loop through the characters looking for the combination of ". "

As the loop runs the characters are added to currentSentence String?. When the combination is found, the currentSentence is added to sentences[sentenceNumber].

In addition, 2 exceptions have to be caught, the first whent he loop is on iteration 2 as period == index-1. The second is the last sentence as there is no space after the period.

var paragraph = "I want to split a paragraph into sentences. But, there is a problem. My paragraph includes dates like Jan.13, 2014 , words like U.A.E abd numbers like 2.2. How do I split this."

var sentences = [String]()
var sentenceNumber = 0
var currentSentence: String? = ""

var charArray = paragraph.characters
var period = 0

for (index, char) in charArray.enumerate() {
    currentSentence! += "\(char)"
    if (char == ".") {
        period = index

        if (period == charArray.count-1) {
            sentences.append(currentSentence!)
        }
    } else if ((char == " " && period == index-1 && index != 1) || period == (charArray.count-1)) {

        sentences.append(currentSentence!)
        print(period)
        currentSentence = ""
        sentenceNumber++
    }
}
  • Welcome to Stack Overflow! Please consider editing your answer to include an explanation of how your code works. – Matt Aug 25 '15 at 15:18
  • Thanks, Happy to be here. My apologies for not entering a description. When I entered the code, I entered a description. What happens with that description? Or is it meant for something else? – GoGoGreenGiant Aug 26 '15 at 02:30
  • Can you tell what is "enumerate" in this line... for (index, char) in charArray.enumerate() It is a function but what is it doing... – Uzair Aug 28 '15 at 06:11
  • .enumerate() splits the charArray into index values and characters for every iteration of the loop. These values are used by the (index, char) respectfully. – GoGoGreenGiant Aug 29 '15 at 16:43
0

Enumerating by linguistic tags feels like an efficient way of handling this task. We can eliminate overheads for storing superfluous stings.

var paragraph = """
    I want to split a paragraph into sentences. But, there is a problem.
    My paragraph includes dates like Jan.13, 2014 , words like U.A.E abd numbers like 2.2. And emojis like ‍‍‍! How do I split this?
"""
var sentences = [String]()
var wordsInSentences = [(sentence: String, words: [String])]()
private var currentSentence = ""
private var wordsInCurrentSentence = [String]()
paragraph.enumerateLinguisticTags(in: paragraph.startIndex...,
                                  scheme: NSLinguisticTagScheme.nameTypeOrLexicalClass.rawValue,
                                  options: [.omitWhitespace, .omitPunctuation],
                                 invoking: { (tag, wordRange, sentenceRange, stop) in
                                    let word = String(paragraph[wordRange])
                                    let sentence = String(paragraph[sentenceRange])
                                    if currentSentence != sentence {
                                        wordsInSentences.append((currentSentence, wordsInCurrentSentence))
                                        currentSentence = sentence
                                        wordsInCurrentSentence = [word]
                                    } else {
                                        wordsInCurrentSentence.append(word)
                                    }
})
wordsInSentences.removeFirst()
print(wordsInSentences)
// If you don't want emojis as words add `.omitOther` to the option set
// `options: [.omitWhitespace, .omitPunctuation, .omitOther]`
Paul B
  • 3,989
  • 33
  • 46
0
func splitSentencesIn(_ string: String) {
    
    var sentences = [String]()
    var unknowns = [String]()
    
    string.enumerateSubstrings(in: string.startIndex ..< string.endIndex,
                               options: .bySentences) { string, _, _, _ in
        if let sentence = string?.trimmingCharacters(in: .whitespacesAndNewlines), let lastCharacter = sentence.last {
            switch lastCharacter {
            case ".", "?", "!":
                sentences.append(sentence)
            default:
                unknowns.append(sentence)
            }
        }
    }
    
    print("sentences:  ")
    for sentence in sentences {
        print("    \(sentence)")
    }
    print("unknown: ")
    for unknown in unknowns {
        print("    \(unknown)")
    }

}

splitSentencesIn("so this~ some thing! how about this: as story; no idea. Let's go!")
splitSentencesIn("My paragraph includes dates like Jan.13, 2014 , words like U.A.E and numbers like 2.2. so this~ some thing! how about this: as story; no idea. Let's go! Who tree")
splitSentencesIn("look out")

print out:

sentences:  
    so this~ some thing!
    how about this: as story; no idea.
    Let's go!
unknown: 
sentences:  
    My paragraph includes dates like Jan.13, 2014 , words like U.A.E and numbers like 2.2. so this~ some thing!
    how about this: as story; no idea.
    Let's go!
unknown: 
    Who tree
sentences:  
unknown: 
    look out

I was inspired by another question and answer: Function that separates sentences and questions in swift

Cable W
  • 633
  • 1
  • 8
  • 17