Split long string with spaces but without punctuation

Question

I have a long string that i need to break by spaces so i did this in ios

let str = """
يَا أَيُّهَا الَّذِينَ آمَنُوا لَا تَقْرَبُوا الصَّلَاةَ وَأَنْتُمْ سُكَارَىٰ حَتَّىٰ تَعْلَمُوا مَا تَقُولُونَ وَلَا جُنُبًا إِلَّا عَابِرِي سَبِيلٍ حَتَّىٰ تَغْتَسِلُوا ۚ وَإِنْ كُنْتُمْ مَرْضَىٰ أَوْ عَلَىٰ سَفَرٍ أَوْ جَاءَ أَحَدٌ مِنْكُمْ مِنَ الْغَائِطِ أَوْ لَامَسْتُمُ النِّسَاءَ فَلَمْ تَجِدُوا مَاءً فَتَيَمَّمُوا صَعِيدًا طَيِّبًا فَامْسَحُوا بِوُجُوهِكُمْ وَأَيْدِيكُمْ ۗ إِنَّ اللَّهَ كَانَ عَفُوًّا غَفُورًا
"""
let count = str.components(separatedBy: " ").count
        
print(count) // 49

and it gives 49 but same thing in kotlin gives 51 here

val str = getString(R.string.valueHere)

val count = str.split(" ").count()

Log.d("count is " , count.toString()) // 51

With

<string name="valueHere">يَا أَيُّهَا الَّذِينَ آمَنُوا لَا تَقْرَبُوا الصَّلَاةَ وَأَنْتُمْ سُكَارَىٰ حَتَّىٰ تَعْلَمُوا مَا تَقُولُونَ وَلَا جُنُبًا إِلَّا عَابِرِي سَبِيلٍ حَتَّىٰ تَغْتَسِلُوا ۚ وَإِنْ كُنْتُمْ مَرْضَىٰ أَوْ عَلَىٰ سَفَرٍ أَوْ جَاءَ أَحَدٌ مِنْكُمْ مِنَ الْغَائِطِ أَوْ لَامَسْتُمُ النِّسَاءَ فَلَمْ تَجِدُوا مَاءً فَتَيَمَّمُوا صَعِيدًا طَيِّبًا فَامْسَحُوا بِوُجُوهِكُمْ وَأَيْدِيكُمْ ۗ إِنَّ اللَّهَ كَانَ عَفُوًّا غَفُورًا</string>

I need word count to be 49 in android; in android it seems that it counts decorate characters in spaces, How to fix this and produce the same result in Kotlin ?

Edit:

fun getColorRange(): Range<Int> { 
    
    val text =  // my long string here
    val all = text.split (" ")
    val sub = (wordFrom..wordTo).map { all[it] }.joinToString(" ")
    val lower = text.indexOf(sub)
    val upper = lower + sub.length
    return Range<Int>(lower, upper)
}

if arr length is different in Kotlin sub will be different substring

Any help is greatly appreciated I'm stick for weeks with this problem ? — sheko, Sep 24 '21 at 16:27
Are there any double spaces? One implementation could decide to put a "" element in between them and the other might not (I know Java would with a split). Look at the arrays of elements each puts out and find where the differences are, that would tell you the most. — Gabe Sechan, Sep 24 '21 at 16:40
@GabeSechan thanks for the reply , it seems that split in Kotlin works with non supplied characters I only supplied white space `" "` , How to prevent this in Kotlin is there any other way to make it split with only the white space ? can we use pattern or stringtokenizer ? — sheko, Sep 24 '21 at 16:45
@GabeSechan as you see in code no double spaces it's the same string — sheko, Sep 24 '21 at 16:47
it seems like for some reason kotlin thinks there are spaces between the decorations and the words, like you said — alonkh2, Sep 24 '21 at 17:18
After a bit of testing, it seems like there are actual spaces between the decorative characters and the words. — alonkh2, Sep 24 '21 at 17:32
A regex on checking for non-white spaces (\S) gave me 50 matches, meaning 51 enclosing. That being said, regex is interpreting ۚ as white space. I would ask, what is your need for this 49 as a solution, and we can likely figure out a solution that doesn't count on 49 necessarily. — Benjamin Charais, Sep 24 '21 at 17:47
As a fact, Kotlin splits on spaces only, but there really are exactly 50 space chars (0x20) in this string. So the question isn't why Kotlin splits on them, but why Swift isn't ;-) Maybe additional spaces where accidentally added during text copying for some reason? — broot, Sep 24 '21 at 18:12
@broot look here https://stackoverflow.com/a/69300528/5820010 in swift when someone added answer in swift but with punctuation characters it gave 51 , meaning that we can do it in Kotlin but without including punctuation characters but I don't know how — sheko, Sep 24 '21 at 18:31
@BenjaminCharais I need same output as swift let me clarify , I need to color some words with different color from some index to another so I need to split the string to array with spaces then get the strings matching that range to color them this is the problem having 51 instead of 49 where the pre-range is previously calculated upon so will make this coloring wrong — sheko, Sep 24 '21 at 18:38

Zain · Accepted Answer · 2021-09-25T00:05:43.797

By logging the split String to see where the issues are :

يَا
أَيُّهَا
الَّذِينَ
آمَنُوا
لَا
تَقْرَبُوا
الصَّلَاةَ
وَأَنْتُمْ
سُكَارَىٰ
حَتَّىٰ
تَعْلَمُوا
مَا
تَقُولُونَ
وَلَا
جُنُبًا
إِلَّا
عَابِرِي
سَبِيلٍ
حَتَّىٰ
تَغْتَسِلُوا
ۚ     >>>>>>>>>>>>>>>>>>>>> Problem here
وَإِنْ
كُنْتُمْ
مَرْضَىٰ
أَوْ
عَلَىٰ
سَفَرٍ
أَوْ
جَاءَ
أَحَدٌ
مِنْكُمْ
مِنَ
الْغَائِطِ
أَوْ
لَامَسْتُمُ
النِّسَاءَ
فَلَمْ
تَجِدُوا
مَاءً
فَتَيَمَّمُوا
صَعِيدًا
طَيِّبًا
فَامْسَحُوا
بِوُجُوهِكُمْ
وَأَيْدِيكُمْ
ۗ    >>>>>>>>>>>>>>>>>>>>> Problem here
إِنَّ
اللَّهَ
كَانَ
عَفُوًّا
غَفُورًا

So, apparently the problem is on the upper diacritics (or markers for accurately speaking) like ۚ or ۗ because they're not considered valid characters.

I believe that the Kotlin version is more accurate than the Swift one, because what you need is:

Separate this String on SPACE as a delimiter (FULL STOP)

What Swift tends to do is that it doesn't recognize the upper diacritics/markers, i.e. it considers them nothing, and doesn't count them when the string is split. Probably there is another Swift function that can detect that, not sure about that as this is not a part of your question.

And as you have a couple of those markers; therefore the Kotlin version count more than the Swift one by two (i.e. 51 instead of 49).

So, the question would be: How to remove the upper diacritics/markers from a string before splitting it?

Thanks to this answer that lists those type of markers; and in Kotlin you can use the String replace() method to replace them with nothing:

Here is a snippet to fix your example:

var str = getString(R.string.valueHere)
str = str
    .replace("\u0615", "") //ARABIC SMALL HIGH TAH
    .replace("\u0616", "") //ARABIC SMALL HIGH LIGATURE ALEF WITH LAM WITH YEH
    .replace("\u0617", "") //ARABIC SMALL HIGH ZAIN
    .replace("\u0618", "") //ARABIC SMALL FATHA
    .replace("\u0619", "") //ARABIC SMALL DAMMA
    .replace("\u061A", "") //ARABIC SMALL KASRA
    .replace("\u06D6", "") //ARABIC SMALL HIGH LIGATURE SAD WITH LAM WITH ALEF MAKSURA
    .replace("\u06D7", "") //ARABIC SMALL HIGH LIGATURE QAF WITH LAM WITH ALEF MAKSURA
    .replace("\u06D8", "") //ARABIC SMALL HIGH MEEM INITIAL FORM
    .replace("\u06D9", "") //ARABIC SMALL HIGH LAM ALEF
    .replace("\u06DA", "") //ARABIC SMALL HIGH JEEM
    .replace("\u06DB", "") //ARABIC SMALL HIGH THREE DOTS
    .replace("\u06DC", "") //ARABIC SMALL HIGH SEEN
    .replace("\u06DD", "") //ARABIC END OF AYAH
    .replace("\u06DE", "") //ARABIC START OF RUB EL HIZB
    .replace("\u06DF", "") //ARABIC SMALL HIGH ROUNDED ZERO
    .replace("\u06E0", "") //ARABIC SMALL HIGH UPRIGHT RECTANGULAR ZERO
    .replace("\u06E1", "") //ARABIC SMALL HIGH DOTLESS HEAD OF KHAH
    .replace("\u06E2", "") //ARABIC SMALL HIGH MEEM ISOLATED FORM
    .replace("\u06E3", "") //ARABIC SMALL LOW SEEN
    .replace("\u06E4", "") //ARABIC SMALL HIGH MADDA
    .replace("\u06E5", "") //ARABIC SMALL WAW
    .replace("\u06E6", "") //ARABIC SMALL YEH
    .replace("\u06E7", "") //ARABIC SMALL HIGH YEH
    .replace("\u06E8", "") //ARABIC SMALL HIGH NOON
    .replace("\u06E9", "") //ARABIC PLACE OF SAJDAH
    .replace("\u06EA", "") //ARABIC EMPTY CENTRE LOW STOP
    .replace("\u06EB", "") //ARABIC EMPTY CENTRE HIGH STOP
    .replace("\u06EC", "") //ARABIC ROUNDED HIGH STOP WITH FILLED CENTRE
    .replace("\u06ED", "") //ARABIC SMALL LOW MEEM

val split = str.split(" ")

val count = str.split(" ").count {
    it.isNotBlank()
}
Log.d("count is ", "$count")

This is the test verification result on a Kotlin compiler

UPDATE:

I have a long string that I need to color range inside it with a different color inside a textView , so split it with spaces get needed words by lower and upper word index, then join them in one string to color their range inside the long string , the above answer did give 49 but it removed important characters mentioned with replace , so any try to tweak your code to consider this ?

So, if you'd follow the top approach, you just need to remove the blanks from the split String, for this you can use the filter{} reduction after replacing all the markers with blanks

fun getColorRange(input: String, wordFrom: Int, wordTo: Int): Range<Int> {
    val text = input
        .replace("\u0615", "") //ARABIC SMALL HIGH TAH
        .replace("\u0616", "") //ARABIC SMALL HIGH LIGATURE ALEF WITH LAM WITH YEH
        .replace("\u0617", "") //ARABIC SMALL HIGH ZAIN
        .replace("\u0618", "") //ARABIC SMALL FATHA
        .replace("\u0619", "") //ARABIC SMALL DAMMA
        .replace("\u061A", "") //ARABIC SMALL KASRA
        .replace("\u06D6", "") //ARABIC SMALL HIGH LIGATURE SAD WITH LAM WITH ALEF MAKSURA
        .replace("\u06D7", "") //ARABIC SMALL HIGH LIGATURE QAF WITH LAM WITH ALEF MAKSURA
        .replace("\u06D8", "") //ARABIC SMALL HIGH MEEM INITIAL FORM
        .replace("\u06D9", "") //ARABIC SMALL HIGH LAM ALEF
        .replace("\u06DA", "") //ARABIC SMALL HIGH JEEM
        .replace("\u06DB", "") //ARABIC SMALL HIGH THREE DOTS
        .replace("\u06DC", "") //ARABIC SMALL HIGH SEEN
        .replace("\u06DD", "") //ARABIC END OF AYAH
        .replace("\u06DE", "") //ARABIC START OF RUB EL HIZB
        .replace("\u06DF", "") //ARABIC SMALL HIGH ROUNDED ZERO
        .replace("\u06E0", "") //ARABIC SMALL HIGH UPRIGHT RECTANGULAR ZERO
        .replace("\u06E1", "") //ARABIC SMALL HIGH DOTLESS HEAD OF KHAH
        .replace("\u06E2", "") //ARABIC SMALL HIGH MEEM ISOLATED FORM
        .replace("\u06E3", "") //ARABIC SMALL LOW SEEN
        .replace("\u06E4", "") //ARABIC SMALL HIGH MADDA
        .replace("\u06E5", "") //ARABIC SMALL WAW
        .replace("\u06E6", "") //ARABIC SMALL YEH
        .replace("\u06E7", "") //ARABIC SMALL HIGH YEH
        .replace("\u06E8", "") //ARABIC SMALL HIGH NOON
        .replace("\u06E9", "") //ARABIC PLACE OF SAJDAH
        .replace("\u06EA", "") //ARABIC EMPTY CENTRE LOW STOP
        .replace("\u06EB", "") //ARABIC EMPTY CENTRE HIGH STOP
        .replace("\u06EC", "") //ARABIC ROUNDED HIGH STOP WITH FILLED CENTRE
        .replace("\u06ED", "") //ARABIC SMALL LOW MEEM

    val all = text.split(" ").filter { it.isNotBlank() } // Remove the blanks (i.e. the markers)
    val sub = (wordFrom..wordTo).map { all[it] }.joinToString(" ")

    Log.d("LOG_TAG", "getColorRange: $sub")
    val range = text.indexOf(sub[0], wordFrom)
    return Range<Int>(range, range + sub.length)
}

Sample usage:

getColorRange(str, 18, 22)

// Output:
//  حَتَّىٰ تَغْتَسِلُوا وَإِنْ كُنْتُمْ مَرْضَىٰ

getColorRange(str, 0, 48) // Should return the entire string as this is the total number of words

// Output:
// يَا أَيُّهَا الَّذِينَ آمَنُوا لَا تَقْرَبُوا الصَّلَاةَ وَأَنْتُمْ سُكَارَىٰ حَتَّىٰ تَعْلَمُوا مَا تَقُولُونَ وَلَا جُنُبًا إِلَّا عَابِرِي سَبِيلٍ حَتَّىٰ تَغْتَسِلُوا وَإِنْ كُنْتُمْ مَرْضَىٰ أَوْ عَلَىٰ سَفَرٍ أَوْ جَاءَ أَحَدٌ مِنْكُمْ مِنَ الْغَائِطِ أَوْ لَامَسْتُمُ النِّسَاءَ فَلَمْ تَجِدُوا مَاءً فَتَيَمَّمُوا صَعِيدًا طَيِّبًا فَامْسَحُوا بِوُجُوهِكُمْ وَأَيْدِيكُمْ إِنَّ اللَّهَ كَانَ عَفُوًّا غَفُورًا

Also notice that there is an issue in the range value, as the sub is a list, not a String, so the below is wrong

val range = text.indexOf(sub)

Instead, you need to get the index of the first item in the sub, and starting from the wordFrom not from the beginning of the string:

val range = text.indexOf(sub[0], wordFrom)

Hmm... did you verify that this solution works? I ask because I think the result does not really depend on any additional chars. Only the number of spaces matter and there are 50 spaces in this string, so even if we would remove everything else and only keep these 50 spaces, we would still get 51 as a result. — broot, Sep 24 '21 at 18:46
@broot yes tested it, please check that on the updated answer; couldn't paste it here as the link is long — Zain, Sep 24 '21 at 18:52
Ahh, ok, I missed the fact that you counts only not blank items. — broot, Sep 24 '21 at 18:56
Hi @Zain very grateful for the effort but let me clarify my case I have a long string that I need to color range inside it with a different color inside a textView , so split it with spaces get needed words by lower and upper word index, then join them in one string to color their range inside the long string , the above answer did give 49 but it removed important characters mentioned with replace , so any try to tweak your code to consider this ? — sheko, Sep 24 '21 at 20:33
@sheko This means that you already know the lower and upper word indices within the long string right? — Zain, Sep 24 '21 at 20:59
@Zain yes I know them zero based according to swift split which is correct not by Kotlin meaning if they are 0 and 5 they will produce different substring result with 49 in swift than 51 in Kotlin this is the problem that I need to make they split in Kotlin same as swift — sheko, Sep 24 '21 at 22:54
I am confused a bit; so you have a string that has 49 words, is that the entire String? or that a part of it? .. And which part do you want to colorize in either case.. Can you give some example please as this is not illustrated more in the question — Zain, Sep 24 '21 at 22:59
ok let's clear it , imagine I have `me is fruit this day or the other` which is 8 words , and word indices with me are 0 to 2 then color part will be `me is fruit` this is ok in swift as like this example in Kotlin there may count decorate and say that the above string is 10 so coloring will be wrong in Kotlin as there may be decorate character say in first part that would give `me is [docrate]` — sheko, Sep 24 '21 at 23:08
I have the word indices but according to swift counting mechanism which ignores decorate characters and for sorry this doesn't happen in Kotlin — sheko, Sep 24 '21 at 23:12
@sheko I got you now; Please have a look at the UPDATE section in the answer — Zain, Sep 24 '21 at 23:54
@Zain don't you find that the characters replaced will be lost from the string ? — sheko, Sep 25 '21 at 10:19
@sheko right because the original question is to count only the 48 words like Swift, and Kotlin does what is expected to see; although swift doesn't remove the markers; it just silently discard them... What if you want to include those markers on Swift?, I think you'd think of a different approach because they are dismissed already.. As this goes far off the original question, and this violates SO guidelines; so, I'd suggest you to open another question, and feel free to drop me a tag if you got stuck — Zain, Sep 25 '21 at 10:45

Split long string with spaces but without punctuation

1 Answers1