Finding all instances of a substring in a string

Question

In my last question I asked about parsing the links out of an HTML page. Since I haven't found a solution yet I thought I tried something else in the meantime: search for every <a href= and copy whatever is there until I hit a </a>.

Now, my C is a bit rusty but I do remember i can use strstr() to get the first instance of that string, but how do I get the rest?

Any help is appreciated.

PS: No. This is not homework on school or something like that. Just so you know.

Bad, bad idea, doomed to failure. What happens when you hit an `` tag? Use an XML parser. — user229044, Mar 02 '11 at 15:24
Thanks. I know it's a bad idea but I haven't found an XML parser that it's not uber complicated that has a good example of how to do this. If you know of one (plus an example code) please do send it my way — Mr Aleph, Mar 02 '11 at 15:35

chrisaycock · Accepted Answer · 2011-03-02T15:37:57.623

4

You can use a loop:

char   *ptr = haystack;
size_t nlen = strlen (needle);

while (ptr != NULL) {
  ptr = strstr (ptr, needle);
  if (ptr != NULL) {
    // do whatever with ptr
    ptr += nlen;  // hat tip to @larsman
  }
}

edited Mar 02 '11 at 15:37

answered Mar 02 '11 at 15:22

chrisaycock

36,470
14
88
125

1

Loops infinitely if `needle` is found at least once. You have to move past the match in every iteration. Also, you have to check for `NULL` *after* `strstr`. – Fred Foo Mar 02 '11 at 15:25
Given the OP's pattern, I'd do `ptr += strlen(needle)` (or better, `size_t nlen = strlen(needle)` before the loop. – Fred Foo Mar 02 '11 at 15:31
Thanks. How do I check that I reached the ? – Mr Aleph Mar 02 '11 at 15:55
@Mr Aleph: The `` is one of your needles. You'll have to search for ` – chrisaycock Mar 02 '11 at 16:02
strncpy() or something like that to get the string out? – Mr Aleph Mar 02 '11 at 16:10
Please be aware that parsing structured documents this way is horribly error-prone, unless you are dealing with known pre-set documents. Otherwise, what will happen to tags that have different order of keywords in them, such as . – Gnudiff Mar 02 '11 at 16:30
@Gnudiff Agreed. I advised Xerces on @Andrew White's answer, but the OP seems adamantly against a library. – chrisaycock Mar 02 '11 at 17:03

score 3 · Answer 2 · answered Mar 02 '11 at 15:22

3

Why not use libxml which has a very good HTML parser built in?

answered Mar 02 '11 at 15:22

Andrew White

52,720
19
113
137

I'm trying not to use external libs, specially if they are GPL but I did already check that lib. However I cannot find a good example of how to do this, if you have a good example of how to parse links out of an HTML page using libxml I am willing to use it. THanks – Mr Aleph Mar 02 '11 at 15:34
1

Here are examples: http://xmlsoft.org/tutorial/index.html What I would do personally is use libxml's XPath, because it is the easiest way to get array of ALL s in document with one query. I am a bit rusty on Xpath, but I think the query was simply: "/a" or something like that, to find all elements in the document. I would consider all the strstr examples as 19th century. This is not how things should be done nowadays anymore. – Gnudiff Mar 02 '11 at 15:37
1

@Mr Aleph: If you don't want GPL, try [Apache Xerces](http://xerces.apache.org/). – chrisaycock Mar 02 '11 at 15:40
@chrisaycock I tried compiling that on Windows and I couldn't. Most of the errors were on the fact that I am using a C compiler and code in this lib is C++. Thanks tho. – Mr Aleph Mar 02 '11 at 15:49
@Gnudiff thanks. I check exampled D:Code for XPath Example and I have no idea what that is doing and how to use it. It's always the same with using 3rd party libs. Unless you understand how the author thinks you will have a hard time understanding his / her code. I am not a programmer, I am an engineer in need of writing code so my reading of someone else's code is not very good. – Mr Aleph Mar 02 '11 at 15:52

Gnudiff · Answer 3 · 2011-03-02T16:34:39.970

Okay, the original answer and my comments seemed to require more information than was comfortable in commenting section, so I decided to create a new answer.

First off, what you are attempting to do IS a programming task already, which WILL require some programming aptitude, depending on your exact needs.

Secondly, there have been some answers provided that suggest you use loops of char finding and regexps. Both of these are horribly error-prone ways to do things, as discussed, for example, here.

The normal way for parsing HTML/XML stuff nowadays is by using an external library designed for this. In fact these libraries are by now sort of standard and in many programming languages they are already built-in.

For your particular needs, I am rusty on both C and XPath either, but it should work approximately like this:

start up an XML/HTML parser.
load into it your HTML document as character string
tell the parser to find all instances of tag (using XPath)
it will return to you a "set of nodes"
process the set of nodes in a loop, doing with each tag whatever you need

I found some other examples, maybe this one is better: http://xmlsoft.org/example.html

As you can see there, there is an XML document (which doesn't matter, since HTML is just subset of XML, your HTML document should work too).

In Python or similar language this would be extremely easy, in some pseudocode this would look like this:

p=new HTMLParser
p->load(my html document)
resultset=p->XPath_Search("//a") # this will find all A elements in the HTML document
for each result of resultset:
   write(result.href)
end for

this would generally write out HREF part of all A elements in document. A decent tutorial on what can you use XPath for is eg here.

I am afraid in C this would be somewhat more convoluted, but the idea is the same and it IS a programming task.

If this is some quick-and-dirty work you might use suggested strstr() or regexp searches, with no external libraries. However, please keep in mind that depending on your exact task, you are very likely to miss a number of outgoing links or misread their contents.

score 0 · Answer 4 · answered Mar 02 '11 at 15:21

0

C strings are just pointers to the first character; to get the next match simply call it again and pass the pointer to the end of the previous match you got.

answered Mar 02 '11 at 15:21

Arkku

41,011
10
62
84

score 0 · Answer 5 · 2011-03-02T16:18:49.070

Here is what I would do (not tested, just my idea):

char* hRef_start  = "<a href=";
char* hRef_end    = "</a>";

Assume your text is in

char text[1000];
char * first = strstr(text , hRef_start);
if(first)
{
    char * last = strstr(first , hRef_end);
    if(last)
         last--;
    else
         //Error here.

    char * link = malloc((last - first + 2) * sizeof(char));
    copy_link(link , first , last);
}

void copy_link(char * link , const char * first , const char * last)
{

     while(first < last)
     {
           *link = *first;
           ++first;
     }
     *link = 0;
}

You should check if malloc() succedded, and make sure you free(), also make sure on copy_link() that none of the args is null.

Finding all instances of a substring in a string

5 Answers5

Linked