-1

Let's say I have something like:

Sample 1: Your number is <foo>12345</foo> and your code is <foo>29939</foo>. 
Sample 2: Your number is <foo attr="x">12345</foo> and your code is <foo>29939</foo>. 

I would like to break this String into an array of string.

Something like the following for Sample 1:

array[0] = Your number is
array[1] = 12345
array[2] = and your code is
array[3] = 29939

Sample 2:

array[0] = Your number is
array[1] = x|12345 (adding attr value to it)
array[2] = and your code is
array[3] = 29939

I am looking for <foo> with or without attribute in the String and need to break the String accordingly.

I found an easy way to replace something under with some value.

Example: matcher.replaceAll("bar") which resulted in something like:

Your number is bar and your code is bar

What I would like to see is to the break the string into an array or list whenever I see the tag <foo> in the string value.

Luiggi Mendoza
  • 85,076
  • 16
  • 154
  • 332
serverfaces
  • 1,155
  • 4
  • 22
  • 49
  • 1
    String.split is what you need – Juned Ahsan Nov 02 '15 at 22:35
  • Are nested tags possible? Like `abc def123xyz`? If so how should they be handled? – Pshemo Nov 02 '15 at 22:36
  • 1
    @JunedAhsan I don't think so. OP needs something more like parsing the contents of this half xml string. – Luiggi Mendoza Nov 02 '15 at 22:36
  • best to use an xml parser. if you did not have the need to get attribute values form within the tags then you could have easily done string.split("|") – AbtPst Nov 02 '15 at 22:38
  • No tested tags... I am only looking for something that is and may have a single attribute, let's call it "attr" and also is case insensitive as I'm getting XML in the middle of a regular string... I found ways to get it case insensitive etc., – serverfaces Nov 02 '15 at 22:40
  • You may try wrapping the current string with `` and use an XML parser. – Luiggi Mendoza Nov 02 '15 at 22:43
  • In general, you will not be successful trying to parse arbitrary xml or html using a regex. You can succeed if you have a limited, well-defined set of cases to recognize. See [the canonical answer](http://stackoverflow.com/a/1732454/17300) – Stephen P Nov 02 '15 at 22:49
  • The problem is that the string is not pure XML. If it were, then I wouldn't use regex for it... I rather use XML parsing technologies to parse it... What I'm trying to do is that when I see I need to process those string values differently. And the tag may have the attribute too though it is optional... – serverfaces Nov 02 '15 at 22:52
  • I had something like: String FOO_TAG_PATTERN = "(?i)]*)>(.+?)"; String FOO_TAG_ATTR = "\\s*(?i) attr\\s*=\\s*(\"([^\"]*\")|'[^']*'|([^'\">\\s]+))"; – serverfaces Nov 02 '15 at 22:53
  • You can make it "pure" XML by wrapping it in an arbitrary root level tag, if it is an otherwise well-formed fragment... I've had success before by taking the "sample" text and wrapping a tag around it — `sample text` becomes `sample text` then parse it with an XML parser. – Stephen P Nov 02 '15 at 22:58

1 Answers1

0

Assuming that format of your text doesn't have any nested tags you should be fine with something like:

String[] arr = sentence
        .trim()
        .replaceAll("<foo\\s+attr=\"([^\"]+)\">", "<foo>$1|")
        .replaceAll("^<foo>|</foo>\\.?$","")
        .split("\\s?</?foo>\\s?");

which will:

  1. trim() trim whitespaces at start and end of your text
  2. replaceAll("<foo\\s+attr=\"([^\"]+)\">", "<foo>$1|") replace each <foo attr="data"> with <foo>data| which means it changes

    Your number is <foo attr=\"x\">12345</foo> and your code is <foo>29939</foo>.
    

    into

    Your number is <foo>x|12345</foo> and your code is <foo>29939</foo>.
    //                  ^^^^^^^ 
    

    so now we have only <foo> and </foo> so we can simply split our string on these tags

  3. replaceAll("^<foo>|</foo>\\.?$","") now to prepare for split on <foo> or </foo> we need to remove the ones at start and end of the string to avoid having empty elements in result array

  4. split("\\s?</?foo>\\s?"); split on <foo> or </foo> (including optional whitespaces surrounding them.
Pshemo
  • 122,468
  • 25
  • 185
  • 269