Qt - Regex to filter rich-text string and replace substrings

Question

I have a QString of rich text more or less in this format:

<span background-color="red"><a name='item1'></a> property1 </span> + <span background-color="blue"><a name='item2'></a> property2 </span>

It can have more tags, but all will have the same structure. Also, between each tag, operators will show up - this is a string that is supposed to represent a calculation.

I need a regex to traverse the string and extract both the item1, item2, ...; but also the property1, property2,... substrings so I can then retrieve a value which I have stored somewhere else.

Then, after retrieving these values, and if, for example, property1=value1 and property2=value2 , I need to create another string like:

value1 + value2

This string will be evaluated to compute the calculation.

What would be the regex to read the string?

What would be the regex to replace in a copied string?

NOTE I do not intend to parse HTML with these regexps. The string of rich-text I need to filter has at most the tags and structure represented above. It will not have other types of tags, nor will it have other attributes besides the ones in the example string above. It can only have more examples of that same tag structure: a span, containing an anchor tag with a name attribute and some text to display.

NOTE2 @Passerby posted in the comments of this question a link to a very aproximate solution. I forgot one (hopefully small) detail about my objective: I also need to catch whatever is between the span tags as a string as well, instead of simply checking for a char like @Passerby (very well) suggested. Any ideas?

NOTE3 I actually still argue that this is not the same question as the duplicate marked one. While the strings I am filtering look like HTML, they are actually rich-text. They will always have this rigid structure/format, so RegEx is perfectly viable for what I need to do. After some great comments I got from a few users, namely @Passerby, I decided to go for it and this works perfectly for what I need:

Sample string:

<span background-color="red"><a name='item1'></a> property1 </span> + 300 * <span background-color="blue"><a name='item2'></a> property2 </span> + Math.sqrt(<span background-color="green"><a name='item3'></a> property3 </span>)

Regex:

/ <span.*?><a name='(.*?)'><\/a>\s*(.*?)\s*<\/span>(((.*?)?)(?=<)|) / g

Outputs:

MATCH 1 
1. [38-43] `item1` 
2. [50-59] `property1` 
3. [67-76] ` + 300 * ` 
4. [67-76] ` + 300 * ` 
5. [67-76] ` + 300 * ` 
MATCH 2 
1. [115-120] `item2` 
2. [127-136] `property2` 
3. [144-157] ` + Math.sqrt(` 
4. [144-157] ` + Math.sqrt(` 
5. [144-157] ` + Math.sqrt(` 
MATCH 3 
1. [197-202] `item3` 
2. [209-218] `property3` 
3. [226-226] (null, matches any position)

@Bakuriu howcome? It does not explain how to extract the attribute's value substring in that question. — Joum, Jul 10 '13 at 09:23
@Joum That answer simply say that "don't parse [X]HTML with regex". — Passerby, Jul 10 '13 at 09:30
You should use HTML parser. Regexps are not good at HTML parsing. — Pavel Strakhov, Jul 10 '13 at 09:31
See [Handling HTML in Qt](http://qt-project.org/wiki/Handling_HTML#e5de527c3b26d4ded7c44e89b22f3a9e). You can use QtWebKit or libraries mentioned in "Manual HTML processing" section. — Pavel Strakhov, Jul 10 '13 at 09:40
Funny how this apparently much more complex but similar task got a totally different reaction here in SO: http://stackoverflow.com/questions/188545/regular-expression-for-extracting-text-from-an-rtf-string — Joum, Jul 10 '13 at 09:41
Not sure if this helps: http://qt-project.org/wiki/Handling_HTML — Passerby, Jul 10 '13 at 09:49
Yeah, it helps. I just don't understand why a Regex can't be used to find certain blocks in a string. This _isn't_ HTML; There is not and there will not be _nesting_ of tags; If it can be found, it can also be replaced, I think. Too bad I don't know/understand regex... — Joum, Jul 10 '13 at 09:53
@Joum Still you should not _parse_ HTML with regex (the most upvoted answer in your posted QA was (re-)making a parser), but if your content is guaranteed to be that simple and regular, you may try this: http://regex101.com/r/mJ0wU9 — Passerby, Jul 10 '13 at 10:35
@Passerby this is (almost) exactly what I needed. But actually I forgot (what I think is) a minor detail: instead of simply catching the `char` outside the tags for the operation, is it possible to catch everything as a string? — Joum, Jul 10 '13 at 11:39
@Passerby didn't quite cut it, but I got there with a little testing of my own... Priceless tips though, thank you sooo much! — Joum, Jul 17 '13 at 08:10

Sebastian Lange · Answer 1 · 2013-07-10T11:40:31.697

This would be probably something like:

QRegExp rx("^(?:\\<span background-color=\"red\"\\>\\<a name=')(\\w)(?:'\\>\\</a\\>)\s*(\\d+)\s*(?:\\</span\\>)\s*(\+)\s*(?:\\<span background-color=\"blue\"\\>\\<a name=')(\\w)(?'\\>\\</a\\>)\")\\s*(\\d+)\\s*\\</span\\>)$");

rx.IndexIn(myText);
qDebug() << rx.cap(1) << rx.cap(2) << rx.cap(3) << rx.cap(4) << rx.cap(5);
//will return item1 prop1 + item2 prop2

given item would be one word and property would be a number. I did something very similar in a calculator for our software.

The trick is, start with small bits:

rx("\\<a name='\\w'\\>");

which would capture the item but eventually the complete line. Then go for next bit and keep it on until you got the whole line like you want it to be. Regular Expressions can be very powerfull but also very frustrating.

Good luck

Edit: Every bracket () can be accessed via \1 in replace function. (?:) brackets are not captured! So :

QString text = "My Text";
text.replace("^My( Text)$","His\\1");
//will have returned: His Text

Did you see @Passerby's comment to the question section? Can you share some insight? — Joum, Jul 10 '13 at 12:39
Since you are not going to parse random HTML/XHTML Pages but a given static setup which always creates the same tags or at least only minor variations, you are safe with RegExp. But using HTML Parser still is a possibility. — Sebastian Lange, Jul 15 '13 at 05:57

score 0 · Answer 2 · answered Jul 10 '13 at 10:42

I don't understand regexps either. With this kind of parsing problem I would use quick and (maybe) dirty solution like this:

QString str = "<span background-color='red'><a name='item1'></a> property1 </span> + <span background-color='blue'><a name='item2'></a> property2 </span>";
QStringList slist = str.split("<");

qDebug() << slist;

foreach (QString s, slist)
{
    if (s.startsWith("/a"))
    {
        qDebug() << "property:" << s.split(" ")[1];
    }
    else if (s.startsWith("a name"))
    {
        qDebug() << "item:" << s.split("'")[1];
    }
    else if (s.startsWith("/span>"))
    {
        QString op = s.mid(6).trimmed();
        if (op != "")
            qDebug() << "operator:" << op;
    }
}

And output is:

item: "item1" 
property: "property1" 
operator: "+" 
item: "item2" 
property: "property2"

Of course, this will break down if the format changes. But so will the regexp too.

If the format would be any more complicated I would try to change the format to valid XML and then using Qt's XML classes to parse the data.

If you end up using this kind of solution, I really recommend adding some additonal validity checks.

Qt - Regex to filter rich-text string and replace substrings

2 Answers2