0

I have a document with some special characters like non-breaking space, non-breaking hyphen, and so on. I want to normalize this document and replace these special characters with space. In addition since the content of this document is gathered from different resources, I have different forms of "Yeh" (ی) in it, and I want to normalize them.

Is it possible to find and replace unicode characters in a document using sed command? Can I use Unicode codes instead of surface form of the character? for example can I use x00a0 instead of non-breaking space in sed command? How?


Sorry for bad explanation. My documents are encoded in UTF8, and contain non-English characters. for example I have a document in Arabic, a document in Urdu, and one in Persian (Farsi). now I want to replace some of the characters in these files by another character. By normalizing, I mean that I want to replace all forms of "Yeh" into one form. (As you might now, there are many forms of this character which is used in Arabic, but for simplification and some processing issues I want to unify all these forms.

ThiefMaster
  • 310,957
  • 84
  • 592
  • 636
Hakim
  • 11,110
  • 14
  • 34
  • 37
  • You could, using GNU sed, but you should consider using `ed` or `ex` for modifying files. Also, I don't know what you mean by "normalize". – ormaaj Jun 30 '12 at 07:45
  • see http://stackoverflow.com/questions/8562354/remove-unicode-characters-from-textfiles-sed-other-bash-shell-methods, this can also be done in perl – Nahuel Fouilleul Jun 30 '12 at 08:36

2 Answers2

1

To process UTF-8 files, you have to parse each characters from begin to end. If you need to do it efficiently, you have to write a real program rather then trying to script a solution.

If you just want to script it, it is easier to convert it to UTF-16 and then process the characters.

A fairly inefficient way would be:

#!/bin/bash
function px {
 local a="$@"
 local i=0
 while [ $i -lt ${#a}  ]
  do
   printf \\x${a:$i:2}
   i=$(($i+2))
  done
}
(iconv -f UTF8 -t UTF16 | od -x |  cut -b 9- | xargs -n 1) |
if read utf16header
then
 px $utf16header
 out=''
 while read line
  do
   if [ "$line" == "000a" ]
    then
     out=$out$line
     px $out
     out=''
    else
     # put your coversion logic here.
     # e.g
     # if [ "$line" == "0031" ] ;  then
     #    line="0041"
     # fi
     out=$out$line
   fi
  done
fi | iconv -f UTF16 -t UTF8
pizza
  • 7,296
  • 1
  • 25
  • 22
0

This might work for you (GNU sed):

echo abcd | sed 'p;y/\x61\x62\x63/ABC/'
abcd
ABCd
potong
  • 55,640
  • 6
  • 51
  • 83