0

How can I extract all URLs in a HTML using C standard library?

I am trying to deal with it using sscanf(), but the valgrind gives error (and I am even not sure if the code can meet my requirement after debugging successfully, so if there are other ways, please tell me). I stored the html content in a string pointer, there are multiple URLs (including absolute URL and relative URL, e.g.http://www.google.com, //www.google.com, /a.html, a.html and so on) in it. I want to extract them one by one and store them separately into another string pointer.

I am also thinking about using strstr(), but then I have no idea about how to get the second url.

My code (I skip the assert here) using sscanf:

int
main(int argc, char* argv[]) {
    char *remain_html = (char *)malloc(sizeof(char) * 1001);
    char *url = (char *)malloc(sizeof(char) * 101);

    char *html = "<A HREF=\"http://www.google.com\">navigation</a>"
                 "<a href=\"/a.html\">search</a>";
    printf("html: %s\n\n", html);

    sscanf(html, "<a href=\"%s", remain_html);
    printf("after first href tag: %s\n\n", remain_html);
    sscanf(remain_html, "%s\">", url);
    printf("first web: %s\n\n", url);
    sscanf(remain_html, "<a href=\"%s", remain_html);
    printf("after second href tag: %s\n\n", remain_html);

    free(remain_html);
    free(url);
}

The valgrind gives: Conditional jump or move depends on uninitialised value(s).

If anybody could help, thank you so much!

Emmm
  • 33
  • 4
  • Is your question the first sentence in your post, or how to fix the error at the bottom? – Robert Harvey Apr 10 '20 at 15:40
  • Because the error is very clear, but to understand it, you have to know what the words "conditional," "jump," "move" and "unitialised" mean. – Robert Harvey Apr 10 '20 at 15:42
  • Why reinvent the wheel? [Parse html using C](https://stackoverflow.com/q/1527883/1115360) dates back to 2009. Also, the answers to [RegEx match open tags except XHTML self-contained tags](https://stackoverflow.com/q/1732348/1115360) contain information on why regexes and HTML don't go together. – Andrew Morton Apr 10 '20 at 15:49
  • @Robert Harvey The first sentence is my final goal. I tried to solve it by the following code, and it gives error. What I ask is if my code can gives out my desire, how to modify it and fix the bug, if not, what will be a feasible way to solve it. – Emmm Apr 10 '20 at 16:02

1 Answers1

0

valgrind warn you about non initialized data (used in test), considering your program only does sscanf and printf that means you very probably have a problem with your scanf

if I change a little your program to print the result of sscanf, so show much elements it get :

int
main(int argc, char* argv[]) {
    char *remain_html = (char *)malloc(sizeof(char) * 1001);
    char *url = (char *)malloc(sizeof(char) * 101);

    char *html = "<A class=\"mw-jump-link\" HREF=\"#mw-head\">Jump to navigation</a>"
                     "<a class=\"mw-jump-link\" href=\"#p-search\">Jump to search</a>";
    printf("html: %s\n\n", html);

    printf("%d\n", sscanf(html, "<a href=\"%s", remain_html));
    printf("after first href tag: %s\n\n", remain_html);
    printf("%d\n", sscanf(remain_html, "%s\">", url));
    printf("first web: %s\n\n", url);
    printf("%d\n", sscanf(remain_html, "<a href=\"%s", remain_html));
    printf("after second href tag: %s\n\n", remain_html);

    free(remain_html);
    free(url);
}

the execution is :

pi@raspberrypi:/tmp $ ./a.out
html: <A class="mw-jump-link" HREF="#mw-head">Jump to navigation</a><a class="mw-jump-link" href="#p-search">Jump to search</a>

0
after first href tag: 

-1
first web: 

-1
after second href tag: 

pi@raspberrypi:/tmp $ 

so the first scanf got nothing (0 element), that means it does not set remain_html and that one is non initialized when it is used by the next sscanf with an undefined behavior

Because of the format

"<a href=\"%s"

the first sscanf waits for a string starting by

 <a href="

but html starts by

<A class=

which is different, so it stop from the second character and does not set remain_html


To use sscanf is not the right way, search for the prefix <a href=" may be in uppercase for instance using strcasestr, then extract the URL up to the closing "

Example :

#include <stdio.h>
#include <string.h>
#include <ctype.h>

/* in case you do not have that function */
char * strcasestr(char * haystack, char *needle)
{
  while (*haystack) {
    char * ha = haystack;
    char * ne = needle;

    while (tolower(*ha) == tolower(*ne)) {
      if (!*++ne)
        return haystack;
      ha += 1;
    }
    haystack += 1;
  }

  return NULL;
}

int main(int argc, char* argv[]) {
  char *html = "<A HREF=\"http://www.google.com\">navigation</a>"
               "<a href=\"/a.html\">search</a>";
  char * begin = html;
  char * end;

  printf("html: %s\n", html);

  while ((begin = strcasestr(begin, "<a href=\"")) != NULL) {
    begin += 9; /* bypass the header */
    end = strchr(begin, '"');

    if (end != NULL) {
      printf("found '%.*s'\n", (int) (end - begin), begin);
      begin = end + 1;
    }
    else {
      puts("invalid url");
      return -1;
    }
  }
}

Compilation and execution :

pi@raspberrypi:/tmp $ gcc -Wall a.c
pi@raspberrypi:/tmp $ ./a.out
html: <A HREF="http://www.google.com">navigation</a><a href="/a.html">search</a>
found 'http://www.google.com'
found '/a.html'
pi@raspberrypi:/tmp $ 

Note I know the second parameter of strcasestr is in lower case so it is useless to do do tolower(*ne) and *ne is enough, but I given a definition of the function out of the current context

bruno
  • 32,421
  • 7
  • 25
  • 37