Need to split Unicode string

Question

I am using the moses toolkit for my translation system. I am using Assamese and English parallel corpus and trained them. But some proper nouns are not translated. This is because I have a very small corpus (parallel data set). So I want to use the transliteration process in my translation system.

I am using this command for my translation: echo 'কানাদা এখন বিশাল দেশ ।'| ~/mymoses/bin/moses -f ~/work/mert-work/moses.ini

This gave me the output "কানাদা is a vast country".

This is because the word "কানাদা" is not in my parallel corpus.

So I took some parallel list of words in Assamese and English, and break each word character-wise. Thus, each line of the two files would have single words with a space between each character (or each syllable). i have used these 2 files to train the system as normal translation task

Then I used the following command echo 'কানাদা এখন বিশাল দেশ ।'| ~/mymoses/bin/moses -f ~/work/mert-work/moses.ini | ./space.pl

This gave me the output "ক া ন া দ া is a vast country"

I had to break the word because i have trained the system character-wise..

Then i used the transliteration system that i have trained using the command:

echo 'কানাদা এখন বিশাল দেশ ।'| ~/mymoses/bin/moses -f ~/work/mert-work/moses.ini | ./space.pl | ~/mymoses/bin/moses -f ~/work1/train/model/moses.ini

This gave me the output "c a n a d a is a vast country"

The characters are transliterated..but the only problem is the spaces between the word.So i want to use a perl file that will join the word. My final command will be

echo 'কানাদা এখন বিশাল দেশ ।'| ~/mymoses/bin/moses -f ~/work/mert-work/moses.ini | ./space.pl | ~/mymoses/bin/moses -f ~/work1/train/model/moses.ini | ./join.pl

Help me with this "join.pl" file.

It might help if you told us _why_ you want to split the Assamese words. I suspect you may have an [X/Y problem](http://meta.stackexchange.com/questions/66377/what-is-the-xy-problem) here. In one of the comments below, you mention that you want to transliterate the characters: if so, you're probably better off asking (in a separate question) for a way to do _that_. — Ilmari Karonen, Dec 24 '13 at 16:31

Toto · Accepted Answer · 2013-12-24T17:57:41.567

4

How about:

use utf8;
my $str = "ভাৰত is a famous country. দিল্লী is the capital of ভাৰত";
$str =~ s/([\x{0980}-\x{09FF}])(?=[\x{0980}-\x{09FF}])/$1 /g;
say $str;

output:

ভ া ৰ ত is a famous country. দ ি ল ্ ল ী is the capital of ভ া ৰ ত

You can use it in your program, just change the while loop to:

while(<>) {
    s/([\x{0980}-\x{09FF}])(?=[\x{0980}-\x{09FF}])/$1 /g;
    print $_;
}

But I think you whish to do:

my %corresp = (
    'ভ' => 'Bh',
    'া' => 'a',
    'ৰ' => 'ra',
    'ত' => 't',
);
my $str = "ভাৰত is a famous country. দিল্লী is the capital of ভাৰত";
$str =~ s/([\x{0980}-\x{09FF}])/exists($corresp{$1}) ? $corresp{$1} : $1/eg;
say $str;

Output:

Bharat is a famous country. দিল্লী is the capital of Bharat

NB: It's up to you to build the true corresponding hash. I don't know anything about Assamese characters.

edited Dec 24 '13 at 17:57

answered Dec 20 '13 at 15:39

Toto

89,455
62
89
125

It worked..but i want for a an arbitrary string..Please help..and after translating i get "bh a r a t is a famous country". i want to rejoin the characters that were split i.e. i want the output as " bharat is a famous country".. please help me..thanks in advance – user3064729 Dec 23 '13 at 13:08
@user3064729:I can't do the translation, but the output is what you want, isn't it? – Toto Dec 24 '13 at 14:11
i want to use the perl script as an executable file...so i have to take the sentence from any text file and execute from the terminal...so i can't input the sentence in the program like this "my $str = "ভাৰত is a famous country. দিল্লী is the capital of ভাৰত"; "... – user3064729 Dec 24 '13 at 15:17
it worked..this is what i wanted..can you help me with another program..i now want a program that will join only the english words that were split. for eg. " d i l l i is the capital of bh a r a t"..i should have the output " dilli is the capital of bharat". – user3064729 Dec 24 '13 at 16:54
@user3064729: You can't because you don't know if it is a space between translated letter or between english word. I wonder why do you want to put space beween Assamese letters then remove them? Why not translate directly? – Toto Dec 24 '13 at 17:08
i am building a machine translation system.. for that i have to break the characters in assamese, then translate(actually it is transliteration) in english and then rejoin the characters..you don't have to modify the above program... i just want a separate program that will join the english characters..is it possible ?? kindly help..if it is done my problem will be solved – user3064729 Dec 24 '13 at 17:17
@user3064729: I think a better way is to split every character then transliterate. There're no needs to add spaces then remove them after. I guess you have a correspondance hash linking assamese character and latin character. See my edit. – Toto Dec 24 '13 at 17:47
You are right..but i cannot use "my $str = "ভাৰত is a famous country. দিল্লী is the capital of ভাৰত";"..it has to work for all sentences – user3064729 Dec 24 '13 at 18:14
@user3064729: Of course, do in a while loop as mentionned before. – Toto Dec 24 '13 at 18:16
it gave me the output..but for every different sentence i have to change the program..that would be cumbersome.. – user3064729 Dec 25 '13 at 03:45
hi..i am implementing your suggestion..and it worked great..but some conjunct letters such as 'ঙ্খ' are not transliterated. can you suggest any idea.. – user3064729 Dec 25 '13 at 16:08

score 4 · Answer 2 · answered Dec 24 '13 at 15:47

You can use \p{...} and \P{...} which will allow you to match or not match particular character classes as specified in perluniprops.

I'm using \P{Latin} which selects non-Latin characters , and \s in order not to match spaces:

#! /usr/bin/env perl
#
use strict;
use warnings;
use feature qw(say);

use utf8;
binmode(STDOUT, ':utf8');  # Why is this needed when you specify "use utf8;"?

my $string = "ভাৰত is a famous country";
$string =~ s/([^\p{Latin}\s])/$1 /g;  # Put a space after all non-latin chars
say $string;

This will print out:

ভ া ৰ ত  is a famous country

The only problem is that double space after ত.

score 1 · Answer 3 · answered Dec 19 '13 at 17:20

1

It's doing exactly what you tell it to. @a=split('') will split the entire line, you are not telling it to only split the first word. You will first need to identify the substring you want to split and then split it:

#!/usr/bin/perl
use utf8;
use Getopt::Std;
use IO::Handle;

binmode(STDIN,  ':utf8');
binmode(STDOUT, ':utf8');
binmode(STDERR, ':utf8');

while(<>)
{
    chomp;
    ## find the first word, capture it as $1 and delete it from the line
    s/(.+?)\s//;
    @a=split('',$1);
    ## Print your joined string and the rest of the line
    print join(" ",@a) . " $_\n";
}

answered Dec 19 '13 at 17:20

terdon

3,260
5
33
57

Actually i am using this for building a machine translation system...i am now getting the output "bh a r a t is a famous country". Can you please help me to get the output like this : "bharat is famous country". Actually after splitting the word, it translates it and then again i have to join the letters. Please help – user3064729 Dec 23 '13 at 11:56
@user3064729 I have no idea how you are implementing this or how your translation works, how could I help you? You asked how to split the first word of a line and my answer tells you. If you have another question, please post a separate question and explain what exactly you are trying to do and how it fails. – terdon Dec 23 '13 at 18:10
My work is not to split the first word. Actually the word which is not in English may be anywhere in the sentence. Its not necessary that it is in the first position and it may be more than one word that i need to split.For eg.the sentence might be "দিল্লী is the capital of ভাৰত" Can you help me with a program that will split only the unicode strings placed anywhere in the sentence. – user3064729 Dec 24 '13 at 02:46
Latin _is_ unicode. Do you mean the non-latin characters? Please edit your question to add a more complete example. – terdon Dec 24 '13 at 10:23
Yes, i want to split the characters of the words which are not in English. Here, my sentence consists of Assamese(an Indian Language similar to Bengali language) and English words. I want to split the Assamese words. For eg i have a sentence "দিল্লী is the capital of ভাৰত", i want an output like this "দ ি ল্ ল ী is the capital of ভ া ৰ ত". That is, only the words which are in Assamese are split, and the English words remain as it is. – user3064729 Dec 24 '13 at 11:35
2

@user3064729 I ask again, please EDIT your question and add the extra info there. It is hard to read and easy to miss in the comments. – terdon Dec 24 '13 at 12:12

score 0 · Answer 4 · answered Dec 20 '13 at 14:53

0

Add something like

$str =~ s/([\w]) (?<=[\w.,;:!?])/$1/g;

which intends to remove the space between latin word chars. With a look-ahead. Not 100%.

answered Dec 20 '13 at 14:53

Joop Eggen

107,315
7
83
138

Need to split Unicode string

4 Answers4