convert html entities to unicode(utf-8) strings in c?

Question

Possible Duplicate:
How to decode HTML Entities in C?

This question is very similar to that one, but I need to do the same thing in C, not python. Here are some examples of what the function should do:

input    output

&lt;     <
&gt;     >
&auml;   ä
&#x00DF; ß

The function should have the signature char *html2str(char *html) or similar. I'm not reading byte by byte from a stream.

Is there a library function I can use?

Please be more specific. Do you have html as one string or you are reading it one by one from stream? — qrdl, Sep 12 '09 at 15:44

score 2 · Answer 1 · answered Nov 03 '09 at 15:55

There isn't a standard library function to do the job. There must be a large number of implementation available in the Open Source world - just about any program that has to deal with HTML will have one.

There are two aspects to the problem:

Finding the HTML entities in the source string.
Inserting the appropriate replacement text in its place.

Since the shortest possible entity is '&x;' (but, AFAIK, they all use at least 2 characters between the ampersand and the semi-colon), you will always be shortening the string since the longest possible UTF-8 character representation is 4 bytes. Hence, it is possible to edit in situ safely.

There's an illustration of HTML entity decoding in 'The Practice of Programming' by Kernighan and Pike, though it is done somewhat 'in passing'. They use a tokenizer to recognize the entity, and a sorted table of entity names plus the replacement value so that they can use a binary search to identify the replacements. This is only needed for the non-algorithmic entity names. For entities encoded as 'ß', you use an algorithmic technique to decode them.

score 0 · Answer 2 · edited Nov 20 '17 at 03:55

This sounds like a job for flex. Granted, flex is usually stream-based, but you can change that using the flex function yy_scan_string (or its relatives). For details, see The flex Manual: Scanning Strings.

Flex's basic Unicode support is pretty bad, but if you don't mind coding in the bytes by hand, it could be a workaround. There are probably other tools that can do what you want, as well.

convert html entities to unicode(utf-8) strings in c?

2 Answers2