1

I am reading data over a socket and parsing it with LibXML.

The problem that I am currently having is that sometimes there is a web link in the data which breaks the parser.

http://example.com/?key=value&key2=value

Is there any way to convert that to html characters?

Something like the above to

http://example.com/?key=value&key2=value

Example of socket data:

<node link="http://example.com/?key=value&key2=value" />

EDIT: Found a solution that works for my problem here

Community
  • 1
  • 1
Austin
  • 4,961
  • 3
  • 13
  • 11
  • 1
    Am I blind or aren't there any differences between the two examples? – bolov Nov 14 '15 at 20:43
  • The two examples are identical, aren't they!! – Ikbel Nov 14 '15 at 20:43
  • Use backticks ` or else the amp won't show. – drum Nov 14 '15 at 20:43
  • the markup translates the html codes. Made them into code blocks – bolov Nov 14 '15 at 20:44
  • 3
    it is relatively easy to cook yourself a function. Basically what you need is a search and replace. – bolov Nov 14 '15 at 21:01
  • 2
    html characters starts with an '&' and ends with ';' you either have to parse the string, or iterate all html charaters table and do find replace. here is [the html table](http://www.ascii.cl/htmlcodes.htm). pay intention some characters may be in unicode – milevyo Nov 14 '15 at 21:05
  • @milevyo: You know that. I know that. OP probably knows that. The problem is his source failed to encode URLs correctly. – Joshua Nov 14 '15 at 21:32

2 Answers2

1

You are going to have to do a pre-filter here. Contrary to other indications, search and replace just won't cut it. Consider your search side is &, which matches too much.

Construct the following finite state machine:

NORMAL:
    if next matches "<" then TAG

TAG:
    if next matches "![CDATA[" then CDATA
    TAGSCAN

TAGSCAN:
    if next matches whitespace then TAGSCAN2
    if next matches > or next matches /> then NORMAL

TAGSCAN2:
    if next matches whitespace then TAGSCAN2
    if next matches SRC= or next matches HREF= then URL
    TAGSCAN

URL:
   we found an attribute with a URL in it. Do your search and replace
   on the contents of the URL attribute value, advance past the URL and
   go back to TAGSCAN

CDATA:
   if next doesn't match ]]> then CDATA
   NORMAL
Joshua
  • 40,822
  • 8
  • 72
  • 132
0

I have found a nice solution using the code from Find and Replace that uses a Find and Replace method suggested by bolov.

retval = str_replace(message, size, "&", "&amp;");
if (!retval) {
    printf("Not enough room to replace & with `&amp;'\n");
}
Community
  • 1
  • 1
Austin
  • 4,961
  • 3
  • 13
  • 11