Extract all URLs from HTML in C

Question

How can I extract all URLs in a HTML using C standard library?

I am trying to deal with it using sscanf(), but the valgrind gives error (and I am even not sure if the code can meet my requirement after debugging successfully, so if there are other ways, please tell me). I stored the html content in a string pointer, there are multiple URLs (including absolute URL and relative URL, e.g.http://www.google.com, //www.google.com, /a.html, a.html and so on) in it. I want to extract them one by one and store them separately into another string pointer.

I am also thinking about using strstr(), but then I have no idea about how to get the second url.

My code (I skip the assert here) using sscanf:

int
main(int argc, char* argv[]) {
    char *remain_html = (char *)malloc(sizeof(char) * 1001);
    char *url = (char *)malloc(sizeof(char) * 101);

    char *html = "<A HREF=\"http://www.google.com\">navigation</a>"
                 "<a href=\"/a.html\">search</a>";
    printf("html: %s\n\n", html);

    sscanf(html, "<a href=\"%s", remain_html);
    printf("after first href tag: %s\n\n", remain_html);
    sscanf(remain_html, "%s\">", url);
    printf("first web: %s\n\n", url);
    sscanf(remain_html, "<a href=\"%s", remain_html);
    printf("after second href tag: %s\n\n", remain_html);

    free(remain_html);
    free(url);
}

The valgrind gives: Conditional jump or move depends on uninitialised value(s).

If anybody could help, thank you so much!

Is your question the first sentence in your post, or how to fix the error at the bottom? — Robert Harvey, Apr 10 '20 at 15:40
Because the error is very clear, but to understand it, you have to know what the words "conditional," "jump," "move" and "unitialised" mean. — Robert Harvey, Apr 10 '20 at 15:42
Why reinvent the wheel? [Parse html using C](https://stackoverflow.com/q/1527883/1115360) dates back to 2009. Also, the answers to [RegEx match open tags except XHTML self-contained tags](https://stackoverflow.com/q/1732348/1115360) contain information on why regexes and HTML don't go together. — Andrew Morton, Apr 10 '20 at 15:49
@Robert Harvey The first sentence is my final goal. I tried to solve it by the following code, and it gives error. What I ask is if my code can gives out my desire, how to modify it and fix the bug, if not, what will be a feasible way to solve it. — Emmm, Apr 10 '20 at 16:02

bruno · Accepted Answer · 2020-04-10T17:53:01.830

valgrind warn you about non initialized data (used in test), considering your program only does sscanf and printf that means you very probably have a problem with your scanf

if I change a little your program to print the result of sscanf, so show much elements it get :

int
main(int argc, char* argv[]) {
    char *remain_html = (char *)malloc(sizeof(char) * 1001);
    char *url = (char *)malloc(sizeof(char) * 101);

    char *html = "<A class=\"mw-jump-link\" HREF=\"#mw-head\">Jump to navigation</a>"
                     "<a class=\"mw-jump-link\" href=\"#p-search\">Jump to search</a>";
    printf("html: %s\n\n", html);

    printf("%d\n", sscanf(html, "<a href=\"%s", remain_html));
    printf("after first href tag: %s\n\n", remain_html);
    printf("%d\n", sscanf(remain_html, "%s\">", url));
    printf("first web: %s\n\n", url);
    printf("%d\n", sscanf(remain_html, "<a href=\"%s", remain_html));
    printf("after second href tag: %s\n\n", remain_html);

    free(remain_html);
    free(url);
}

the execution is :

pi@raspberrypi:/tmp $ ./a.out
html: <A class="mw-jump-link" HREF="#mw-head">Jump to navigation</a><a class="mw-jump-link" href="#p-search">Jump to search</a>

0
after first href tag: 

-1
first web: 

-1
after second href tag: 

pi@raspberrypi:/tmp $

so the first scanf got nothing (0 element), that means it does not set remain_html and that one is non initialized when it is used by the next sscanf with an undefined behavior

Because of the format

"<a href=\"%s"

the first sscanf waits for a string starting by

 <a href="

but html starts by

<A class=

which is different, so it stop from the second character and does not set remain_html

To use sscanf is not the right way, search for the prefix <a href=" may be in uppercase for instance using strcasestr, then extract the URL up to the closing "

Example :

#include <stdio.h>
#include <string.h>
#include <ctype.h>

/* in case you do not have that function */
char * strcasestr(char * haystack, char *needle)
{
  while (*haystack) {
    char * ha = haystack;
    char * ne = needle;

    while (tolower(*ha) == tolower(*ne)) {
      if (!*++ne)
        return haystack;
      ha += 1;
    }
    haystack += 1;
  }

  return NULL;
}

int main(int argc, char* argv[]) {
  char *html = "<A HREF=\"http://www.google.com\">navigation</a>"
               "<a href=\"/a.html\">search</a>";
  char * begin = html;
  char * end;

  printf("html: %s\n", html);

  while ((begin = strcasestr(begin, "<a href=\"")) != NULL) {
    begin += 9; /* bypass the header */
    end = strchr(begin, '"');

    if (end != NULL) {
      printf("found '%.*s'\n", (int) (end - begin), begin);
      begin = end + 1;
    }
    else {
      puts("invalid url");
      return -1;
    }
  }
}

Compilation and execution :

pi@raspberrypi:/tmp $ gcc -Wall a.c
pi@raspberrypi:/tmp $ ./a.out
html: <A HREF="http://www.google.com">navigation</a><a href="/a.html">search</a>
found 'http://www.google.com'
found '/a.html'
pi@raspberrypi:/tmp $

Note I know the second parameter of strcasestr is in lower case so it is useless to do do tolower(*ne) and *ne is enough, but I given a definition of the function out of the current context

Sorry I changed to inappropriate html for some reasons, but if the given html contains two satisfied string, it also gives that error. — Emmm, Apr 10 '20 at 16:08
There are several urls in the string in tag, and the urls including both absolute url and relative url, e.g. http://www.google.com, //www.google.com, /a.html, a.html. — Emmm, Apr 10 '20 at 16:39
@Emmm I cannot do something from a description, what is exactly the new value for *html* ? — bruno, Apr 10 '20 at 16:43
It's quite long.. The part of it would be:
URL Paths
Absolute vs. Relative Path

Submit Sites
to Search Engines — Emmm, Apr 10 '20 at 16:55
@Emmm I edited my answer to give you a proposal doing the job — bruno, Apr 10 '20 at 17:42
@Emmm you seems not interrested by my answer, so I delete it ? — bruno, Apr 11 '20 at 14:01
Sorry that I didn't check the message, since dealt with other things. Thanks for your answer, so I think I get the point that only when I know exactly the start and end, then I can use sscanf? So in my situation, I need to use strcasestr to find the start. — Emmm, Apr 13 '20 at 07:39
@Emmm yes *scanf* is a very limited parser and when you have explicit string (out of %) it must be exactly that string. In that case *scanf* cannot be used — bruno, Apr 13 '20 at 07:51

Extract all URLs from HTML in C

1 Answers1