0

I'm trying to create a regex to find a string with a double-underscore __, multiple underscores _ and then another double-underscore __ and extract the part before the final __<string>
The first delimiter should be __ and then multiple _, and string ends with __<String>

The result should be such that string before the second __

example 1- UK__SATHISH_KUMAR__LONDON should result to UK__SATHISH_KUMAR

example 2- UK__SATHISH_KUMAR_MALE__LONDON should result to UK__SATHISH_KUMAR_MALE

public static final String RULE_FILE_NAME_PATTERN =
    "(([a-zA-Z]+)__(([a-zA-Z]+_[a-zA-Z]+_[a-zA-Z]+_[a-zA-Z]+)|([a-zA-Z]+_[a-zA-Z]+_[a-zA-Z]+)|([a-zA-Z]+_[a-zA-Z]+)|([a-zA-Z]+)))(__[\\w]+)*";

This pattern works but fails sonarqube as it's long, can some one help to get a shorter regex?

Stephen P
  • 14,422
  • 2
  • 43
  • 67
  • [Turn this warning off](https://stackoverflow.com/questions/10971968/turning-sonar-off-for-certain-code), sonarqube is doing a bad job here. – Wiktor Stribiżew Jun 03 '21 at 00:01
  • Hmmm, it appears requirement has been changed -- OP originally specified extracting substring `before the second __` thus presuming there could be additional `__`, but the latest edit revised it to extracting `before the final __` which would be a much simpler match (e.g. using a `.*` greedy match). – Leo C Jun 03 '21 at 01:43
  • @Stephen P how to approve the suggestedEdits – Java Evangelist Jun 03 '21 at 09:23

3 Answers3

0

You can group the [A-Za-z]+_[A-Za-z]+ sub-pattern into a non-capturing group of repetitive occurrences as (?:[A-Za-z]+_[A-Za-z]+)*, and optionalize certain parts of the pattern in accordance with your specific requirement, as shown below:

val p = """([A-Za-z]+__(?:[A-Za-z]+(?:_[A-Za-z]+)?)*)(?:__.*)?""".r

val strings = List(
  "uk__john_doe__london__edmonton",
  "us__zoe_smith_female__new_york__manhattan",
  "au__dave_clark_male__sidney",
  "fr__alex__paris",
  "jp__yumiko",
  "no_double_underscore"
)

strings.collect{ case p(x) => x }
// res1: List[String] = List(
//   "uk__john_doe",
//   "us__zoe_smith_female",
//   "au__dave_clark_male",
//   "fr__alex",
//   "jp__yumiko"
// )

Note that the Regex pattern can be simplified using lazy matches like below if the string doesn't have to strictly follow the alphabets_alphabets sub-pattern:

val p = """(.*?__.*?)(?:__.*)?""".r
Leo C
  • 22,006
  • 3
  • 26
  • 39
  • Thanks @Leo C ..The pattern works for all logics and string .except the last pattern of string.. 1) fr__alex__paris to fr__alex and doesnt work for 2) fr__alex to fr__alex – Java Evangelist Jun 03 '21 at 07:20
  • ***To be precise , these are the output i expects*** ==================================================================== * 1) "STRING1__STRING2" = "STRING1__STRING2" * * 2) "STRING1__STRING2__STRING3_STRING4_STRING5" = "STRING1__STRING2" * *3) "STRING1__STRING2_STRING3__STRING4_STRING5" = "STRING1__STRING2_STRING3" * * 4) "STRING1__STRING2_STRING3_STRING4" = "STRING1__STRING2_STRING3_STRING4" * * 5) "STRING1__STRING2_STRING3_STRING4__STRING5" = "STRING1__STRING2_STRING3_STRING4" * – Java Evangelist Jun 03 '21 at 09:40
  • Based on your example, you can simply make the last substring with leading `__` optional. Please see my revised answer. – Leo C Jun 03 '21 at 16:33
0
  1. sorry the UK was just an example it can be any characters ENGLAND__SATHISH_KUMAR__LONDON

Still not clear

Assuming it can can any word character,at least one, you can use

(\w+__.*)__

Your question needs precision because you return characters UK before what you declare as first delimiter __

Assuming you want always 2 characters before you can do

(\w{2}__.*)__
Dri372
  • 1,275
  • 3
  • 13
0

If you just need all the text before the final __ you can just use:

(.*)__

Its a greedy match and will capture everything in group 1.

Aljodomo
  • 71
  • 6