1

I'm running cygwin under windows 10

Have a dictionary file (1-dictionary.txt) that looks like this:

labelling   labeling
flavour flavor
colour  color
organisations   organizations
végétales   végétales
contr?lée   contrôlée
"   "

The separators between are TABs (\ts).

The dictionary file is encoded as UTF-8.

Want to replace words and symbols in the first column with words and HTML entities in the second column.

My source file (2-source.txt) has the target UTF-8 and ASCII symbols. The source file also is encoded as UTF-8.

Sample text looks like this:

Cultivar was coined by Bailey and it is generally regarded as a portmanteau of "cultivated" and "variety" ... The International Union for the Protection of New Varieties of Plants (UPOV - French: Union internationale pour la protection des obtentions végétales) offers legal protection of plant cultivars ...Terroir is the basis of the French wine appellation d'origine contrôlée (AOC) system

I run the following sed one-liner in a shell script (./3-script.sh):

sed -f <(sed -E 's_(.+)\t(.+)_s/\1/\2/g_' 1-dictionary.txt) 2-source.txt > 3-translation.txt

The substitution of English (en-GB) words with American (en-US) words in 3-translation.txt is successful.

However the substitution of ASCII symbols, such as the quote symbol, and UTF-8 words produces this result:

vvégétales#x00E9;gvégétales#x00E9;tales)
contrcontrôlée#x00F4;lcontrôlée#x00E9;e (AOC)

If i use only the specific symbol (not the full word) I get results like this:

vé#x00E9;gé#x00E9;tales
"#x0022cultivated"#x0022
contrô#x00F4;lé#x00E9;e

The ASCII quote symbol is appended with &#x0022; - it is not replaced.

Similarly, the UTF-8 symbol is appended with its HTML entity - not replaced with the HTML entity.

The expected output would look like this:

v#x00E9;g#x00E9;tales
#x0022cultivated#x0022
contr#x00F4;l#x00E9;e

How to modify the sed script so that target ASCII and UTF-8 symbols are replaced with their HTML entity equivalent as defined in the dictionary file?

davidhcefx
  • 130
  • 1
  • 8
Jay Gray
  • 1,706
  • 2
  • 20
  • 36
  • 1
    Possible duplicate of [Unexpected substitution for & with sed](https://stackoverflow.com/questions/6200249/unexpected-substitution-for-with-sed) – tripleee Mar 08 '19 at 17:47
  • 1
    Possible duplicate of https://stackoverflow.com/questions/407523/escape-a-string-for-a-sed-replace-pattern – tripleee Mar 08 '19 at 17:48
  • 1
    I tried it, just replace all `&` with `\&` in your `1-dictionary.txt` will solve your problem. Try it, see if it's working. – Til Mar 08 '19 at 18:09

1 Answers1

1

I tried it, just replace all & with \& in your 1-dictionary.txt will solve your problem.

Sed's substitute uses a regex as the from part, so when you use it like that, notice those regex characters and add \ to prepare them to be escaped.

And the to part will have special characters too, mainly \ and &, add extra \ to prepare them to be escaped too.

Above linked to GNU sed's document, for other sed version, you can also check man sed.

Til
  • 5,150
  • 13
  • 26
  • 34