3

I'm trying to use the following regex in Java, that's supposed to match any lang="2-char-lang-name":

String lang = "lang=\"" + L.detectLang(inputText) +"\"";
shovel.replaceFirst("lang=\"[..]\"", lang);

I know that a single slash would be interpreted by regex as a slash and not an escape character (so my code doesn't work), but if I escape the slash, the " won't be escaped any more and I'd get a syntax error.

In other words, how can I include a " in the regex? "lang=\\"[..]\\"" won't work. I've also tried three slashes and that didn't have any matches either.

I am also aware of the general rule that you don't use regex to parse XML/HTML. (and shovel is an XML) However, all I'm doing is, looking for a lang attribute that is within the first 30 characters of the XML, and I want to replace it. Is it really a bad idea to use regex in this case? I don't think using DOM would be any better/more efficient.

Alan Moore
  • 73,866
  • 12
  • 100
  • 156
Spectraljump
  • 4,189
  • 10
  • 40
  • 55
  • Does this answer your question? [Escaping special characters in Java Regular Expressions](https://stackoverflow.com/questions/10664434/escaping-special-characters-in-java-regular-expressions) – ccpizza Sep 06 '20 at 14:55

2 Answers2

8

Three slashes would be correct (\\ + \" becomes \ + " = \"). (Update: Actually, it turns out that isn't even necessary. A single slash also works, it seems.) The problem is your use of [..]; the [] symbols mean "any of the characters in here" (so [..] just means "any character").

Drop the [] and you should be getting what you want:

String ab = "foo=\"bar\" lang=\"AB\"";
String regex = "lang=\\\"..\\\"";
String cd = ab.replaceFirst(regex, "lang=\"CD\"");
System.out.println(cd);

Output:

foo="bar" lang="CD"
Dan Tao
  • 125,917
  • 54
  • 300
  • 447
  • ah yes, I hadn't really parsed what he was doing there with the `[..]`. I think that inside a `[]`, the `.` is interpreted literally, so `[..]` means "any single character which is either a `.` or a `.`". – OpenSauce Jun 18 '11 at 19:59
  • Dang... You're right, 'guess my regex has gotten way too rusty. Thank you. – Spectraljump Jun 18 '11 at 20:21
2

Have you tried it with a single backslash? The output of

public static void main(String[] args) {
  String inputString = "<xml lang=\"the Queen's English\">";
  System.out.println(inputString.replaceFirst("lang=\"[^\"]*\"", "lang=\"American\"" ));
}

is

<xml lang="American">

which, if I'm reading you correctly, is what you want.

EDIT to add: the reason a single backslash works is that it's not actually part of the string, it's just part of the syntax for expressing the string. The length of the string "\"" is 1, not 2, and the method replaceFirst just sees a string containing a " (with no backslash). This is why e.g. \s (the whitespace character class in a regex) has to be written \\s in a Java string literal.

On the wisdom of using regex: this should be fine, if you're sure about the format of the files you're processing. If the files might include a commented-out header complete with lang spec above the real header, you could be in trouble!

OpenSauce
  • 8,533
  • 1
  • 24
  • 29
  • No, it's well formed xml with standards and whatnot. Thanks for pointing out that a single slash would work. I thought it wouldn't since it's also a regex special character. – Spectraljump Jun 18 '11 at 20:22
  • The regex does not see the single slash. The single slash is part of how strings are written in java, so "\"" is a string of length 1. So is "\\", which is a string containing a slash, which as you say is a regex metacharacter. "\\\\" is a string of length 2. If you pass this as a regex, the first slash escapes the second in this case. – chrishmorris May 17 '20 at 13:12