1

I have a Java string which looks like this, it is actually an XML tag:

"article-idref="527710" group="no" height="267" href="pc011018.pct" id="pc011018" idref="169419" print-rights="yes" product="wborc" rights="licensed" type="photo" width="322" "

Now I want to remove the article-idref="52770" segment by using regular expression, I came up with the following one:

trimedString.replaceAll("\\article-idref=.*?\"","");

but it doesn't seem to work, could anybody give me an idea on where I got wrong in my regular expression? I need this to be represented as a String in my Java class, so probably HTMLParser won't help me a lot here. Thanks in advance!

Jim Garrison
  • 85,615
  • 20
  • 155
  • 190
Kevin
  • 6,711
  • 16
  • 60
  • 107
  • 1
    It sounds like you've pulled this string out of HTML file. Why not just use your HTML parser to remove that particular attribute, instead of grabbing it out, regexing it, and stuffing it back in? – Anon. Dec 15 '10 at 21:06
  • @ Anon, this is actually an XML tag, and I only need to use it as a string in my Java class, but for the representation purpose, I have to get rid of that arrtibute "article-idref". – Kevin Dec 15 '10 at 21:07
  • @Robert, for XML massaging just use a Transformer and write an XSLT-snippet. – Thorbjørn Ravn Andersen Dec 15 '10 at 21:08
  • I honestly don't think that an attribute string is the best representation to use internally. What do you use it for? And if it has to be that way, wouldn't it be better to remove the attribute in your XML parser before you pull the element out as a string? – Anon. Dec 15 '10 at 21:09
  • @ Thorbjorn, it is actually a little bit complicated than the problem sounds. I am actually putting this string to an external API to insert it into an platform- OxygenXML IDE. – Kevin Dec 15 '10 at 21:16
  • @ Anon, please see my comment above. Actually this string is obtained by converting from a StringWriter object, which has already been going through TranformerFactory step. – Kevin Dec 15 '10 at 21:19
  • @Robert: be careful not to use things like `\w` or `\s` in Java regexes. They only work on 7-bit data, not even on 8-bit data let alone its 21-bit native character set of Unicode. This is a really evil gotcha. – tchrist Dec 16 '10 at 00:17

3 Answers3

2

Try this:

trimedString.replaceAll("article-idref=\"[^\"]*\" *","");
thejh
  • 44,854
  • 16
  • 96
  • 107
1

I corrected the regular expression by adding quotes and a word boundary (to prevent false matches). Also, in case you didn't, remember to reassign to your string after the replacement:

trimmedString = trimmedString.replaceAll("\\barticle-idref=\".*?\"", "");

See it working at ideone.

Also since this is from an XML document it might be better to use an XML parser to extract the correct attributes instead of a regular expression. This is because XML is quite a complex data format to parse correctly. The example in your question is simple enough. However a regular expression could break on a more complex case, such as a document that includes XML comments. This could be an issue if you are reading data from an untrusted source.

Mark Byers
  • 811,555
  • 193
  • 1,581
  • 1,452
  • great! Some off-topic topic: do you think regular expression is worthy spending time and effort studying at? I am an entry-level programmer specialized in XML and Java. – Kevin Dec 15 '10 at 21:15
  • 2
    @Robert: I think it's useful for a professional to understand how all their tools work. If nothing else, it will help you to choose the correct one for the task. – Mark Byers Dec 15 '10 at 21:17
0

if you are sure the article-idref is allways at the beginning try this:

// removes everything from the beginning to the first whitespace
trimedString = trimedString.replaceFirst("^\\s","");

Be sure to assign the result to trimedString again, since replace does not midify the string itself but returns another string.

Jürgen Steinblock
  • 30,746
  • 24
  • 119
  • 189
  • If this is Unicode, which given that it’s XML is almost certainly true, then that won’t work right in Java. Heck, it won’t even work right for the 8-bit repertoires, because the Java people unwisely demoted commonly occurring whitespace characters like NO-BREAK SPACE from its perversely broken notion of JavaWhitespace. You have to use [these workarounds](http://stackoverflow.com/questions/4304928/unicode-equivalents-for-w-and-b-in-java-regular-expressions/4307261#4307261) when you use regexes in Java, even on its native character set of Unicode! Lame and sad, but true. – tchrist Dec 16 '10 at 00:14