1

In my last question I asked about parsing the links out of an HTML page. Since I haven't found a solution yet I thought I tried something else in the meantime: search for every <a href= and copy whatever is there until I hit a </a>.

Now, my C is a bit rusty but I do remember i can use strstr() to get the first instance of that string, but how do I get the rest?

Any help is appreciated.

PS: No. This is not homework on school or something like that. Just so you know.

Community
  • 1
  • 1
Mr Aleph
  • 1,887
  • 5
  • 28
  • 44

5 Answers5

4

You can use a loop:

char   *ptr = haystack;
size_t nlen = strlen (needle);

while (ptr != NULL) {
  ptr = strstr (ptr, needle);
  if (ptr != NULL) {
    // do whatever with ptr
    ptr += nlen;  // hat tip to @larsman
  }
}
chrisaycock
  • 36,470
  • 14
  • 88
  • 125
  • 1
    Loops infinitely if `needle` is found at least once. You have to move past the match in every iteration. Also, you have to check for `NULL` *after* `strstr`. – Fred Foo Mar 02 '11 at 15:25
  • Given the OP's pattern, I'd do `ptr += strlen(needle)` (or better, `size_t nlen = strlen(needle)` before the loop. – Fred Foo Mar 02 '11 at 15:31
  • Thanks. How do I check that I reached the ? – Mr Aleph Mar 02 '11 at 15:55
  • @Mr Aleph: The `` is one of your needles. You'll have to search for ` – chrisaycock Mar 02 '11 at 16:02
  • strncpy() or something like that to get the string out? – Mr Aleph Mar 02 '11 at 16:10
  • Please be aware that parsing structured documents this way is horribly error-prone, unless you are dealing with known pre-set documents. Otherwise, what will happen to tags that have different order of keywords in them, such as . – Gnudiff Mar 02 '11 at 16:30
  • @Gnudiff Agreed. I advised Xerces on @Andrew White's answer, but the OP seems adamantly against a library. – chrisaycock Mar 02 '11 at 17:03
3

Why not use libxml which has a very good HTML parser built in?

Andrew White
  • 52,720
  • 19
  • 113
  • 137
1

Okay, the original answer and my comments seemed to require more information than was comfortable in commenting section, so I decided to create a new answer.

First off, what you are attempting to do IS a programming task already, which WILL require some programming aptitude, depending on your exact needs.

Secondly, there have been some answers provided that suggest you use loops of char finding and regexps. Both of these are horribly error-prone ways to do things, as discussed, for example, here.

The normal way for parsing HTML/XML stuff nowadays is by using an external library designed for this. In fact these libraries are by now sort of standard and in many programming languages they are already built-in.

For your particular needs, I am rusty on both C and XPath either, but it should work approximately like this:

  • start up an XML/HTML parser.
  • load into it your HTML document as character string
  • tell the parser to find all instances of tag (using XPath)
  • it will return to you a "set of nodes"
  • process the set of nodes in a loop, doing with each tag whatever you need

I found some other examples, maybe this one is better: http://xmlsoft.org/example.html

As you can see there, there is an XML document (which doesn't matter, since HTML is just subset of XML, your HTML document should work too).

In Python or similar language this would be extremely easy, in some pseudocode this would look like this:

p=new HTMLParser
p->load(my html document)
resultset=p->XPath_Search("//a") # this will find all A elements in the HTML document
for each result of resultset:
   write(result.href)
end for

this would generally write out HREF part of all A elements in document. A decent tutorial on what can you use XPath for is eg here.

I am afraid in C this would be somewhat more convoluted, but the idea is the same and it IS a programming task.

If this is some quick-and-dirty work you might use suggested strstr() or regexp searches, with no external libraries. However, please keep in mind that depending on your exact task, you are very likely to miss a number of outgoing links or misread their contents.

Gnudiff
  • 4,297
  • 1
  • 24
  • 25
0

C strings are just pointers to the first character; to get the next match simply call it again and pass the pointer to the end of the previous match you got.

Arkku
  • 41,011
  • 10
  • 62
  • 84
0

Here is what I would do (not tested, just my idea):

char* hRef_start  = "<a href=";
char* hRef_end    = "</a>";

Assume your text is in

char text[1000];
char * first = strstr(text , hRef_start);
if(first)
{
    char * last = strstr(first , hRef_end);
    if(last)
         last--;
    else
         //Error here.

    char * link = malloc((last - first + 2) * sizeof(char));
    copy_link(link , first , last);
}

void copy_link(char * link , const char * first , const char * last)
{

     while(first < last)
     {
           *link = *first;
           ++first;
     }
     *link = 0;
}

You should check if malloc() succedded, and make sure you free(), also make sure on copy_link() that none of the args is null.