0

String string = "రాజులకు రాజైన యీ మన విభుని పూజ సేయుటకు రండి"; //28

I need to traverse these letters one at a time without breaking.

for me it is coming as , because I am splitting using charAt function traversing through the string. ర ర ా జ ు ల క ు ర ా జ ై న య ీ మ న వ ి భ ు న ి ప ూ జ స ే య ు ట క ు ర ం డ ి

Any help appreciated.

Anand Kiran
  • 101
  • 1
  • 6
  • Use `string.codePoints()`/`string.codePointAt()`. – Andy Turner Feb 16 '22 at 14:15
  • `BreakIterator` is the general way to iterate over junks of strings in various Unicode-specified units. Which one exactly you need I can't say (I don't know Telugu and don't know your precise requirements). – Joachim Sauer Feb 16 '22 at 14:15
  • @AndyTurner how to use codePoints please.. it returns 43 for that length of string when there are only 28 characters ..how to interpret this please. – Anand Kiran Feb 16 '22 at 14:30
  • 1
    [This answer](https://stackoverflow.com/a/15949292/2985643) to the question [Java Unicode String length](https://stackoverflow.com/q/15947992/2985643) provides Java code which uses a regex to extract each letter from a String. It was written for processing Tamil, but it works fine for your Telugu string as well. I just ran it, and it extracted 21 Telugu letters `రా, జు, ల, కు, రా, జై, న, యీ, మ, న, వి, భు, ని, పూ, జ, సే, యు, ట, కు, రం, డి` (plus 7 spaces = 28, which is what you expect). – skomisa Feb 16 '22 at 15:31
  • For instance the first _Telugu "letter"_ `రా` (user-perceived) is `ర` (U+0C30, *Telugu Letter Ra*) plus - `ా` (U+0C3E, *Telugu Vowel Sign Aa*) i.e. _two_ codepoints `\u0C30\u0C3E`. Read more at [UNICODE TEXT SEGMENTATION](https://www.unicode.org/reports/tr29/). – JosefZ Feb 16 '22 at 18:04
  • @skomisa This works and I see it in console.. but when I put the same in word, the rendering is lost somehow.. could you help?`List characters = new ArrayList(); Pattern pat = Pattern.compile("\\p{L}\\p{M}*"); Matcher matcher = pat.matcher(arr[i]); while (matcher.find()) { characters.add(matcher.group()); } for (Iterator iterator = characters.iterator(); iterator.hasNext();) { String charat = (String) iterator.next(); tableRowBlank.addNewTableCell().setText(" "); tableRowOne.addNewTableCell().setText(" "+charat+" "); }` – Anand Kiran Feb 17 '22 at 01:12

1 Answers1

0

This accepted answer by halex to the question Java Unicode String length provides Java code which uses a regex to extract each letter from a String:

String s = "ஈஉஐ రాజైన";  // Tamil/Telugu text input
List<String> characters = new ArrayList<String>();
Pattern pat = Pattern.compile("\u0B95\u0BCD\u0BB7\\p{M}?|\\p{L}\\p{M}?");
Matcher matcher = pat.matcher(s);
while (matcher.find()) {
  characters.add(matcher.group());
}

System.out.println(characters);
System.out.println(characters.size());  // Length

It was written for processing Tamil, but it works fine for your Telugu string as well. I just ran it, and it extracted 21 Telugu letters రా, జు, ల, కు, రా, జై, న, యీ, మ, న, వి, భు, ని, పూ, జ, సే, యు, ట, కు, రం, డి (plus 7 spaces = 28, which is what you expect).

(skomisa's comment to the question, 2022-02-16 15:31:00Z) This solves it.

AmigoJack
  • 5,234
  • 1
  • 15
  • 31
Anand Kiran
  • 101
  • 1
  • 6
  • 2
    [1] You should post the code you ran in your answer, with your own explanation. As it stands your answer is not very helpful to future readers. [2] Since your code was primarily from an answer to another question, it is **very important** that you also include a link to it, and credit the original author. [3] To also extract spaces from the Telugu string, just change the regex passed into `Pattern.compile()` to `"\\x20|\\p{L}\\p{M}*"`, where `x20` is the space character, and `|` (vertical bar) is the OR operator. But there are many ways to do this, and there may be better approaches. – skomisa Feb 17 '22 at 04:42