0

My program is loading some news article from the web. I then have an array of html documents representing these articles. I need to parse them and show on the screen only the relevant content. That includes converting all html escape sequences into readable symbols. So I need some function which is similar to unEscape in JavaScript.

I know there are libraries in C to parse html. But is there some easy way to convert html escape sequences like & or ! to just & and !?

  • Can you simply use `sed` on the file before opening it in your program? – Fe2O3 Sep 09 '22 at 21:35
  • I need to load html files dynamically from the web, so it must be some C function – Artem Goldenberg Sep 09 '22 at 21:36
  • *"I need to load html files dynamically from the web, so it must be some C function"* - That sentence did not make much sense – klutt Sep 09 '22 at 21:44
  • Edited the question for better understanding of the problem – Artem Goldenberg Sep 09 '22 at 21:49
  • Could you write your own function that maps escape sequences onto the character you want in whatever encoding scheme you are using. – SafelyFast Sep 09 '22 at 21:54
  • Well, you could write a very simple python program that you call from C – klutt Sep 09 '22 at 21:54
  • @klutt I would prefer if there was some way to do it in pure C – Artem Goldenberg Sep 09 '22 at 21:57
  • There is. You just have to do it yourself. :) – klutt Sep 09 '22 at 21:58
  • 1
    Also, whatever this program is doing, it does not seem like C is the optimal language. Far from it. – klutt Sep 09 '22 at 22:00
  • @klutt Suppose I could convert all Unicode sequences like `!` but there is no way I can fast and easy convert all sequences with letters like `Ι` and all that stuff. Also its just a project for learning C, its not for production – Artem Goldenberg Sep 09 '22 at 22:06
  • @ArtemGoldenberg Yes, I assumed it was a project for learning and not something else. So since it is a learning thing, I'd say that something you really need to learn is to when use C and when not. A good carpenter uses a hammer for hammering and a saw for sawing. He does not try to saw with a hammer because he is learning the hammer. – klutt Sep 10 '22 at 07:47
  • But if you want to do this right, what you should do is to write a proper parser. And another thing, learning how to call python code from C is a very valuable skill. – klutt Sep 10 '22 at 07:56
  • Well, I guess there is no easy way to do it in C. Thanks everyone for clarifying that. And for stating all side routes available – Artem Goldenberg Sep 10 '22 at 21:30

2 Answers2

1

Just wrote and tested a version that does this (crudely). Didn't take long.

You'll want something like this:

typedef struct  {
    int gotLen; // save myriad calls to strlen()
    char *got;
    char *want;
} trx_t;

trx_t lut[][2] = {
    { 5, "&", "&" },
    { 5, "!", "!" },
    { 8, "†", "*" },
};
const int nLut = sizeof lut/sizeof lut[0];

And then a loop with two pointers that copies characters within the same buf, sniffing for the '&' that triggers a search of the replacement table. If found, copy the replacement string to the destination and advance the source pointer to skip past the HTML token. If not found, then the LUT may need additional tokens.

Here's a beginning...

void replace( char *buf ) {
    char *pd = buf, *ps = buf;
    while( *ps )
        if( *ps != '&' )
            *pd++ = *ps++;
        else {
            // EDIT: Credit @Craig Estey
            if( ps[1] == '#' ) {
                if( ps[2] == 'x' || ps[2] == 'X' ) {
                     /* decode hex value and save as char(s) */
                } else {
                     /* decode decimal value and save as char(s) */
                }
                 /* advance pointers and continue */
            }
            for( int i = 0; i < nLut; i++ )
                /* not giving it all away */
                /* handle "found" and "not found" in LUT *
        }
    *pd = '\0';
}

This was the test program

int main() {
    char str[] = "The fox &amp; hound&dagger; went for a walk&#33; & chat.";

    puts( str );
    replace( str );
    puts( str );

    return 0;
}

and this was the output

The fox &amp; hound&dagger; went for a walk&#33; & chat.
The fox & hound* went for a walk! & chat.

The "project" is to write the interesting bit of the code. It's not difficult.

Caveat: Only works when substitution length is shorter or equal to target length. Otherwise need two buffers.

Fe2O3
  • 6,077
  • 2
  • 4
  • 20
  • 1
    `X;` is a hex value of `XX` (e.g. `D5`). So, I'd decode that rather than treating it as a fixed string like `&` – Craig Estey Sep 09 '22 at 22:32
  • @CraigEstey Good point! Thank you... This "sketch" was to point the OP toward a method. Your insight is terrific and would much improve the actual realisation of such a function. Thank you! `:-)` – Fe2O3 Sep 09 '22 at 22:35
  • yeah, its indeed what a program should look like in general. But my main problem is that html has so many escape sequences like `″` `≤` and so on. Too many to hardcode into an array I think. That's why I was hoping that there is for that. In your program, you hardcoded like three of those, but I need hundreds – Artem Goldenberg Sep 09 '22 at 22:44
  • @CraigEstey revised with your suggestion. Thanks again! – Fe2O3 Sep 09 '22 at 22:47
  • @ArtemGoldenberg Yes, when the lexicon is large, there are many equivalences. No way around that. For performance, you could blend in 'hashing' the tokens to speed things up, but that would be another level of complexity. Or, sort the LUT and use a binary search to find/not-find the token's replacement... Many delightful subtasks to this project. Have fun! `:-)` (PS: at least you aren't facing "case insensitive", too...) – Fe2O3 Sep 09 '22 at 22:50
  • @ArtemGoldenberg You can build up the table as you go. Here's a good start: https://reeddesign.co.uk/test/character-entities.html Should be an easy scripting/edit job to convert the table into C code – Craig Estey Sep 10 '22 at 01:22
  • @CraigEstey Thank you again. Answer revised... again... Noticed both decimal OR hexadecimal could appear in HTML... Doh! `:-)` (anything more? `:) ` – Fe2O3 Sep 10 '22 at 01:38
1

This is something that you typically wouldn't use C for. I would have used Python. Here are two questions that could be a good start:

What's the easiest way to escape HTML in Python?

How do you call Python code from C code?

But apart from that, the solution is to write a proper parser. There are lots of resources out there on that topic, but basically you could do something like this:

parseFile()
    while not EOF
        ch = readNextCharacter()
        if ch == '\'
            readNextCharacter()
        elseif ch == '&'
            readEscapeSequence()
        else
            output += ch

readEscapeSequence()
    seq = ""
    ch = readNextCharacter();
    while ch != ';'
        seq += ch
        ch = readNextCharacter();
    replace = lookupEscape(seq)
    output += replace

Note that this is only pseudo code to get you started

klutt
  • 30,332
  • 17
  • 55
  • 95