2

I need help with my work to school. I have got (from stackoverflow of course) this script which capitalizing first character of string.

sed -r "s/(^|\.\s+)./\U&/g" <$temp_file_2

But output of this is in ANSI encoding or what is that. file -bi shows unknown-8bit encoding format.

Is it any change to get output in utf-8 to file ?

P.S.: This sed command is used for capitalizing firs character of line. (with support of special Slovak characters like ščťžýáíéď etc) P.S : File have to be UTF-8 because content is inserted to mysql database. Converting file causing loosing information.

P̲̳x͓L̳
  • 3,615
  • 3
  • 29
  • 37
Tommy
  • 159
  • 1
  • 11
  • 1
    `sed` cannot convert between character encodings; you'll need a separate program like `iconv` to convert the input file first. – chepner Mar 26 '14 at 17:12
  • But I will afterwards put file content to mysql database. File contains special characters like ď ľ š č ť ž ý á í é, which can't be inserted to mysql. I got luck only when imput file was in utf8. – Tommy Mar 26 '14 at 17:19
  • I see some solution to use pear -pe and sed command but I don't know to use it – Tommy Mar 26 '14 at 17:20
  • Are you trying to convert `temp_file2` *to* UTF-8, or convert it *from* UTF-8 to ASCII (losing some information in the process)? – chepner Mar 26 '14 at 17:21
  • I tried to convert it but i loosed characters. I need them all. – Tommy Mar 26 '14 at 17:59

2 Answers2

1

The problem is that sed might have trouble dealing with non-ASCII characters, especially when the system locale is not UTF8.

$ bash -c "echo 'abc,ščťžýáíéď' | LANG= LC_CTYPE= sed -E --debug 's/./\U&/g'"
SED PROGRAM:
  s/./\U&/g
INPUT:   'STDIN' line 1
PATTERN: abc,\o37777777705\o37777777641\o37777777704\o37777777615\o37777777705\o37777777645\o37777777705\o37777777676\o37777777703\o37777777675\o37777777703\o37777777641\o37777777703\o37777777655\o37777777703\o37777777651\o37777777704\o37777777617
COMMAND: s/./\U&/g
MATCHED REGEX REGISTERS
  regex[0] = 0-1 'a'
PATTERN: ABC,\o37777777777\o37777777777\o37777777777\o37777777777\o37777777777\o37777777777\o37777777777\o37777777777\o37777777777\o37777777777\o37777777777\o37777777777\o37777777777\o37777777777\o37777777777\o37777777777\o37777777777\o37777777777
END-OF-CYCLE:
ABC,

As you can see, sed view each non-ASCII character as several separated bytes, so it incorrectly uppercased them. One solution is setting LANG and LC_CTYPE to a UTF8 compatible locale.

$ bash -c "echo 'abc,ščťžýáíéď' | LANG=C.UTF8 LC_CTYPE=C.UTF8 sed -E --debug 's/./\U&/g'"
SED PROGRAM:
  s/./\U&/g
INPUT:   'STDIN' line 1
PATTERN: abc,\o37777777705\o37777777641\o37777777704\o37777777615\o37777777705\o37777777645\o37777777705\o37777777676\o37777777703\o37777777675\o37777777703\o37777777641\o37777777703\o37777777655\o37777777703\o37777777651\o37777777704\o37777777617
COMMAND: s/./\U&/g
MATCHED REGEX REGISTERS
  regex[0] = 0-1 'a'
PATTERN: ABC,\o37777777705\o37777777640\o37777777704\o37777777614\o37777777705\o37777777644\o37777777705\o37777777675\o37777777703\o37777777635\o37777777703\o37777777601\o37777777703\o37777777615\o37777777703\o37777777611\o37777777704\o37777777616
END-OF-CYCLE:
ABC,ŠČŤŽÝÁÍÉĎ

References:

davidhcefx
  • 130
  • 1
  • 8
0

Try this

  cat <src> | iconv -f <srcenc> | sed .... | iconv -t <targetenc> > target

To see list of encodings:

  iconv -l

To see if you guessed encoding of your input file correctly check

cat <src> | iconv -f <srcenc> 
Vytenis Bivainis
  • 2,308
  • 21
  • 28