0

I'm doing some text editing in Java 8 and the files I want to automatically edit often also contain left over formatting infos, which always look like this:

  1. \
  2. A set text - I know which texts are used and I'll provide that info to regex
  3. A number (2 to 4 digits)
  4. Maybe a single blank (which should be replaced too) or nothing

I want to replace all of them with nothing (so: "") and even though I could probably read the text char by char to look for the text, I want to try it with the much more "cleaner" looking regex first. But: I've never really worked with regex, apart from copying the occasional code from Stackexchange.

Examples:

  • \fs14 (font size 14)
  • \ri240 (right indent)
  • \lang1033 (applies a language to a character)

There are also e.g. \par (new paragraph) or \i (italic start) and \i0 (italic end) but I can easily replace these with e.g. originalString.replace("\\par",""). This obiously won't work if I don't know how many and which digits are used, like in the above examples.

I know that the Java code for replacing text using a pattern is:

String newString = originalString.replaceAll(pattern,"");

The needed pattern to address the backslash and the text for the examples above probably looks like this:

(\\\\fs|\\\\ri|\\\\lang)

... but how do I incorporate the number and the blank (if there's one)?

Neph
  • 1,823
  • 2
  • 31
  • 69
  • 1
    Your question isn't totally clear to me, but if it is to others, that's fine. What I really wanted to say in terms of advice is that the answer to old problems shouldn't be new problems. Using regex replaceAll on a source file is scary business and likely to come with it's own issues. – ControlAltDel Jun 29 '20 at 13:58
  • 1
    *"but how do I incorporate the number and the blank (if there's one)?"* --- A number (2 to 4 digits): `"[0-9]{2,4}"` --- Maybe a single blank: `" ?"` --- Combined, that means: `"\\\\(?:fs|ri|lang)[0-9]{2,4} ?"` --- Since something as simple as an optional space is very, VERY basic regex, it means you haven't learned regex. Please do so now, before you ever again ask any question about regex here. StackOverflow is not a teaching site, and is not a substitute for learning the programming languages yourself, and regex is a programming language. – Andreas Jun 29 '20 at 14:00
  • @ControlAltDel What isn't clear? I can edit that in or answer it in the comments. ;) `the answer to old problems shouldn't be new problems` - sorry, what do you mean? I'm not going to use `replaceAll` on the file directly, I'm first reading the text line by line, parse it to objects, then do some extra replacing (e.g. for special characters) and only then do the replacing of the formatting infos, basically right before I write the text to file again. – Neph Jun 29 '20 at 14:03
  • Why the downvotes? If something isn't clear, please write that in the comments and I'll fix it. – Neph Jun 29 '20 at 14:08
  • Have you tried `String pattern = "\\\\(?:(?:fs|ri|lang)\\d{2,4}|par|i0?)\\b";` ? – anubhava Jun 29 '20 at 14:12
  • @Andreas Thanks, I'm going to try it. What is the `?:` in the beginning for? And yes, you're right, I haven't learned regex, I'm more or less completely new to it. I looked at the [wiki](https://stackoverflow.com/tags/regex/info) and that's how I found the info for `|` but it doesn't mention anything about optional characters and the other "teaching" websites I've found usually start off with the already "heavy" stuff. – Neph Jun 29 '20 at 14:14
  • @Neph `?:` is a non capturing group in regex. https://www.regular-expressions.info/brackets.html – user3783243 Jun 29 '20 at 14:31
  • 1
    This seems like a good, slow-starting regex learning guide: [**Regular Expressions | A Complete Beginners Tutorial**](https://blog.usejournal.com/regular-expressions-a-complete-beginners-tutorial-c7327b9fd8eb) – Andreas Jun 29 '20 at 22:22
  • 1
    @anubhava I down-voted for lack of research, since OP obviously didn't even know about quantifiers, one of the very first things you'd learn in any regex tutorial. It's like asking us to write an `if` statement for you, because you haven't yet learned about `if` statements, which are covered very early on in any Java tutorial. StackOverflow is not a teaching site, you have to do your own learning elsewhere, before asking questions here. Not knowing about regex quantifiers (equivalent to `if` / `while` / `for` statements), is a lack of trying to learn to language. Hence down-vote. – Andreas Jun 29 '20 at 22:29
  • @anubhava Thanks, I tested it on the website totok posted but it doesn't match any of the examples. – Neph Jun 30 '20 at 08:56
  • @Neph: You can see it here it works perfectly fine: https://regex101.com/r/iChPjC/5 – anubhava Jun 30 '20 at 09:26
  • @anubhava Sorry, my mistake, I didn't un-escape the other backslashes too. I just tested it with my code and the inital replacing works but it doesn't get rid of the extra space at the end. There are a lot more strings without numbers (apart from `\par` and `\i`) that I have to replace but I already use a normal `replace` with those, so I don't need them to be part of the regex. – Neph Jun 30 '20 at 10:44
  • @anubhava Andreas' suggestion works and I'm currently testing totok's but I'd love to test all 3 (including yours) to see which one is the fastest. – Neph Jun 30 '20 at 10:52
  • @Andreas Want to post your regex code as an answer? – Neph Jul 01 '20 at 10:14

1 Answers1

0

I'm not sure I understood your problem well, so this is a solution I can give you.

\\[a-zA-Z0-9]*\s?

Test it here.

Starting from this, what do we have to change to match your expectations ?

EDIT after your comment :

This one can match the words you like, followed only by 2 to 4 digits (or none), and if there is only a backslash, it also matches the blank character after.

(\\([\bfs\b|\blang\b|\bri\b]*\d{2,4}|\s))

Test it here.

totok
  • 1,436
  • 9
  • 28
  • 1
    What isn't clear? I don't think this'll work though because from what I've read (yes, before asking the question), a `[]` matches any of the characters inside but I only want to replace certain words, e.g. `lang`, or `fs`. – Neph Jun 29 '20 at 14:30
  • Thanks, I'll test it. There's always only a single backslash. I thought you have to escape it twice for regex, so `\\\\`? There are also always 2 to 4 digits, if there are none, it should NOT match. – Neph Jun 30 '20 at 08:55
  • In pure regex, you only have to escape it once, but when you are using the regex as a string in a development language, you may have to escape each regex \ one time (not only for the first \ . This regex is already testing only 2 to 4, and doesn't match 0 and 1 digits. But If you have more than 4, it will match the 4... You can use a regex lookahead to counter this problem, if it is relevant (refer to [this](https://stackoverflow.com/questions/2973436/regex-lookahead-lookbehind-and-atomic-groups) – totok Jun 30 '20 at 09:06
  • If there are more than 4 digits, it shouldn't touch the extra ones, yes. I tested it in my code but with 4 backslashes for each one in your suggestion (`pattern = "(\\\\([\\\\bfs\\\\b|\\\\blang\\\\b|\\\\bri\\\\b]*\\\\d{2,4}|\\\\s))";`) it doesn't replace anything and with just 2 for everything but the first (`pattern = "(\\\\([\\bfs\\b|\\blang\\b|\\bri\\b]*\\d{2,4}|\\s))"; `) I of course get an `Illegal/unsupported escape sequence near index 6` exception for the first `\b`. – Neph Jun 30 '20 at 11:00