1

Im sure this is easily done with regex, just haven't had much experience.
EG, given

char *mystring= blah blah <i>(this is not needed)</i> (Nor this). This is.

it would return

char *return_str = blah blah  . This is.
tomatosource
  • 256
  • 2
  • 12
  • 1
    in fact, the requirement to support nested brackets (or elements) makes it somewhat harder to achieve with regex than it would be otherwise. This is the reason why you'll always get people here shouting loudly that one should never use regex to handle HTML. It can be done for limited known chunks of code, but gets unstuck when you don't know what or how many elements will be there. – Spudley Jun 28 '11 at 17:55
  • 1
    Yeah, your question implies you want to support parentheses nested to arbitrary depth, but regular expressions are not computationally complex enough to solve this general case. (see http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) – Mu Mind Jun 28 '11 at 17:57
  • 2
    Specifically not `blah_blah__._This_is.`(where the underscores are spaces - comment formatting appears to collapse two consecutive spaces even in backticks), or do you just not care about whitespace? – Steve Jessop Jun 28 '11 at 18:05

1 Answers1

4

Even though your question is tagged regex, a regex is not the correct solution.

What you probably want to do is write a simple pushdown automaton.

Here is a really simple example:

char* strip_parens(char* string) {
    int len = strlen(string);
    char* result = malloc(len + 1);
    int num_parens = 0;
    int i = 0, j = 0;
    for(; i < len; i++) {
        char c = string[i];
        if(c == '(') {
            num_parens++;
        } else if(c == ')' && num_parens > 0) {
            num_parens--;
        } else if(num_parens == 0) {
            result[j] = c;
            j++;
        }
    }
    result[j] = '\0';
    return result;
}

I'm not even sure that this qualifies as a pushdown automaton because it uses a simple counter and not a stack, but the concept is similar.

This one only does the parentheses, but it should be simple enough to demonstrate the technique.

Community
  • 1
  • 1
Daniel Pryden
  • 59,486
  • 16
  • 97
  • 135
  • Ok good to know. But lets say that I know that 's and parentheses will not be nested. To be extra clear, I don't want to use this as html code ever again. I just need to get the first link not in parentheses or in italics out. So I'm unsure why some simple pattern replacing "*" for "" and "(*)" for "" can't do the job some how? – tomatosource Jun 28 '11 at 18:04
  • @tomatosource: If you *know* that the parentheses will never be nested, then yes, a simple regex will do the trick. But that's not what you originally asked for. – Daniel Pryden Jun 28 '11 at 18:07
  • 1
    @tomatosource: Note that even if the parentheses aren't nested, doing a simple loop over the chars is O(n) and has a trivial overhead. It's unlikely that a regex will be faster, and a regex certainly won't be any easier to debug. – Daniel Pryden Jun 28 '11 at 18:12
  • Yes and the nature of my problem Im only dealing with relatively small strings. OK just saw your above code, thank you very much. Is there a way to locate instances of certain sub strings within mystring so I can use a similar method to scrap the italic sections? – tomatosource Jun 28 '11 at 18:15
  • @tomatosource: Yeah, you just need a small stack (a short string will do). When you encounter a `<`, append it and any subsequent characters to the stack until you encounter a `>` or determine that you're not looking at a tag (e.g., you encounter another `<` or whitespace). When you encounter a `>`, check if your stack contains `""` or `""` and respond accordingly. – Daniel Pryden Jun 28 '11 at 18:19
  • @Daniel, that algorithm is insufficient for XHTML with CDATA sections - eg, `<![CDATA[>]]>`. Also HTML tags may contain whitespace. – bdonlan Jun 28 '11 at 18:23
  • Thanks, thats got it. Just in case anyone is wondering I'm making a terribly inefficient program to find the average wikipedia articles depth from "Philosophy" according to http://xkcd.com/903/ tooltip. – tomatosource Jun 28 '11 at 18:23
  • @bdonlan: Good point. I started working on an example and quickly got mired in how complex it is, even if you're only interested in one tag and can ignore the rest. A full HTML parser is of course the correct answer here. – Daniel Pryden Jun 28 '11 at 18:32
  • @tomatosource: Neat idea... I'd be curious to see what your result is. Is there a particular reason you're using C to do this job, though? I would think something like Python (together with a library like [BeautifulSoup](http://www.crummy.com/software/BeautifulSoup/)) would be a lot easier for this kind of thing. – Daniel Pryden Jun 28 '11 at 18:33
  • I'll make sure to throw it here when I'm done. Main reason for C is it's it's the only language I'm comfortable with. I'm a second year software engineering student, this is just a holiday thing to make sure I don't get too rusty. – tomatosource Jun 28 '11 at 18:40
  • 1
    @tomatosource: when that XKCD cartoon was published, I found on my first attempt a cycle that didn't visit "Philosophy". Can't remember where I started, or I'd give it to you as a test case, but be sure to put cycle-detection into your program :-) – Steve Jessop Jun 28 '11 at 18:40
  • Well here is the result of the first run through-- AVERAGE OF 1000 ARTICLES IS 15.398000. 7 pages haven't made it. – tomatosource Jul 08 '11 at 05:19