Encode HTML Characters in C

Question

I am reading data over a socket and parsing it with LibXML.

The problem that I am currently having is that sometimes there is a web link in the data which breaks the parser.

http://example.com/?key=value&key2=value

Is there any way to convert that to html characters?

Something like the above to

http://example.com/?key=value&key2=value

Example of socket data:

<node link="http://example.com/?key=value&key2=value" />

EDIT: Found a solution that works for my problem here

Am I blind or aren't there any differences between the two examples? — bolov, Nov 14 '15 at 20:43
the markup translates the html codes. Made them into code blocks — bolov, Nov 14 '15 at 20:44
it is relatively easy to cook yourself a function. Basically what you need is a search and replace. — bolov, Nov 14 '15 at 21:01
html characters starts with an '&' and ends with ';' you either have to parse the string, or iterate all html charaters table and do find replace. here is [the html table](http://www.ascii.cl/htmlcodes.htm). pay intention some characters may be in unicode — milevyo, Nov 14 '15 at 21:05
@milevyo: You know that. I know that. OP probably knows that. The problem is his source failed to encode URLs correctly. — Joshua, Nov 14 '15 at 21:32

score 1 · Answer 1 · answered Nov 14 '15 at 21:12

You are going to have to do a pre-filter here. Contrary to other indications, search and replace just won't cut it. Consider your search side is &, which matches too much.

Construct the following finite state machine:

NORMAL:
    if next matches "<" then TAG

TAG:
    if next matches "![CDATA[" then CDATA
    TAGSCAN

TAGSCAN:
    if next matches whitespace then TAGSCAN2
    if next matches > or next matches /> then NORMAL

TAGSCAN2:
    if next matches whitespace then TAGSCAN2
    if next matches SRC= or next matches HREF= then URL
    TAGSCAN

URL:
   we found an attribute with a URL in it. Do your search and replace
   on the contents of the URL attribute value, advance past the URL and
   go back to TAGSCAN

CDATA:
   if next doesn't match ]]> then CDATA
   NORMAL

"![CDATA[" in url??? he want just to reformat the urls, not to parse the file content. — milevyo, Nov 14 '15 at 21:52

score 0 · Accepted Answer · edited May 23 '17 at 11:51

0

I have found a nice solution using the code from Find and Replace that uses a Find and Replace method suggested by bolov.

retval = str_replace(message, size, "&", "&amp;");
if (!retval) {
    printf("Not enough room to replace & with `&amp;'\n");
}

edited May 23 '17 at 11:51

Community

1
1

answered Nov 18 '15 at 22:25

Austin

4,961
3
13
11

Encode HTML Characters in C

2 Answers2