2

We have files with some chars represented by decimal(!) ascii values enclosed in cid(#) as e.g. (cid:104) for h. The string hello is thus represented as (cid:104)(cid:101)(cid:108)(cid:108)(cid:111).

How can I substitute this with the corresponding ascii characters using sed?

Here is an example file:

$ cat input.txt
first line
pre (cid:104)(cid:101)(cid:108)(cid:108)(cid:111) post
last line

What I've tried so far is:

$ x="(cid:104)(cid:101)(cid:108)(cid:108)(cid:111)"
$ echo $x | sed 's/(cid:\([^\)]*\))/\1/g'
104101108108111

But wee need the output to be hello

$ cat output.txt
first line
pre hello post
last line

I'm trying to use printf in sed. But cannot find out how to pass the backreference \1 to printf

sed 's/(cid:\([^\)]*\))/'`printf "\x$(printf %x \1)"`'/g'
wolfrevo
  • 6,651
  • 2
  • 26
  • 38
  • 1
    given your updated question, what is the exact, desired output? Note it is important to provide a [mcve] from the very beginning, since your update invalidates our current answers. – fedorqui Jul 25 '16 at 09:28
  • You might need to explain why 'using sed' is a requirement. That is much, much more difficult than using a more suitable tool such as awk or perl... – Toby Speight Jul 25 '16 at 09:42

2 Answers2

3
$ cat input.txt 
first line
pre (cid:104)(cid:101)(cid:108)(cid:108)(cid:111) post
last line

$ perl -pe 's/\(cid:(\d+)\)/chr($1)/ge' input.txt > output.txt
$ cat output.txt
first line
pre hello post
last line

Thanks @123 for suggesting to use chr($1) instead of sprintf "%c", $1. See chr for documentation

Reference: Integer ASCII value to character in BASH using printf

Community
  • 1
  • 1
Sundeep
  • 23,246
  • 2
  • 28
  • 103
  • in our special case there are also "normal" characters. i.e. not all characters are represented as `(cid:#)` only some of them. I edit my original question to show an example file – wolfrevo Jul 25 '16 at 09:21
  • can you also explicitly post how you want the output file to look with your sample input file? I will edit answer accordingly or delete – Sundeep Jul 25 '16 at 09:27
  • 1
    You can use `chr` instead of `sprintf`, i.e `perl -pe 's/\(cid:(\d+)\)/chr($1)/ge'` – 123 Jul 25 '16 at 09:35
  • 1
    @123 thanks :) ... didn't know about that function.. will edit the answer after OP clarifies his requirement – Sundeep Jul 25 '16 at 09:38
  • This does pretty much what I need. Thus +1. But I'd like to find a sed-only solution. – wolfrevo Jul 25 '16 at 09:48
  • 1
    @wolfrevo That isn't going to happen. – 123 Jul 25 '16 at 09:50
  • @123 I'm trying to use printf in sed. But cannot find out how to pass the backreference `\1` to printf. sed 's/(cid:\([^\)]*\))/'`printf "\x$(printf %x \1)"`'/g'. See updated question. – wolfrevo Jul 25 '16 at 10:21
  • 1
    @wolfrevo , I don't think that would be possible.. see http://stackoverflow.com/questions/22544044/passing-sed-backreference-to-base64-command – Sundeep Jul 25 '16 at 10:37
0

Using %c you can convert an ASCII code into its corresponding character:

$ awk 'BEGIN {printf "%c", 104}'
h

So it is a matter of extracting the numbers from within (cid:XX). This I do by setting the FS to ( and looping through the fields:

awk -v FS='(' '{for (i=2; i<=NF; i++) {
                  r=gensub(/cid:([0-9]+)\)/, "\\1", "g", $i);
                  printf "%c", r+0
                  }
               }' file

This uses gensub() and accesses to the captured groups as described in GNU awk: accessing captured groups in replacement text. Hence dependent on a GNU awk.

For your given input it returns:

$ awk -v FS='(' '{for (i=2; i<=NF; i++) {r=gensub(/cid:([0-9]+)\)/, "\\1", "g", $i); printf "%c", r+0}}' file
hello
Community
  • 1
  • 1
fedorqui
  • 275,237
  • 103
  • 548
  • 598