1

well i got a nice solution here but the regex split the string into "" string and 2 other splits i needed.

String  Result = "<ahref=https://blabla.com/Securities_regulation_in_the_United_States>Securities regulation in the United States</a> - Securities regulation in the United States is the field of U.S. law that covers transactions and other dealings with securities.";

String [] Arr =  Result.split("<[^>]*>");
for (String elem : Arr) {
    System.out.printf(elem);
}

the result is:

Arr[0]= ""
Arr[1]= Securities regulation in the United States
Arr[2]= Securities regulation in the United States is the field of U.S. law that covers transactions and other dealings with securities.

the Arr[1] and Arr[2] splits are fine I just cant get rid of the Arr[0].

Pshemo
  • 122,468
  • 25
  • 185
  • 269
gb051
  • 13
  • 4

2 Answers2

2

You can use an opposite regex to capture what you want by using a regex like this:

(?s)(?:^|>)(.*?)(?:<|$)

Working demo

IDEOne Code working

Code:

String line = "ahref=https://blabla.com/Securities_regulation_in_the_United_States>Securities regulation in the United States</a> - Securities regulation in the United States is the field of U.S. law that covers transactions and other dealings with securities.";

Pattern pattern = Pattern.compile("(?s)(?:^|>)(.*?)(?:<|$)");
Matcher matcher = pattern.matcher(line);
while (matcher.find()) {
    System.out.println("group 1: " + matcher.group(1));
}
Federico Piazza
  • 30,085
  • 15
  • 87
  • 123
1

You can't avoid that empty string if you are using only split, especially since your regex is not zero-length.

You could try removing that first match placed at start of your input, and then split in rest of matches like

String[] Arr =  Result.replaceFirst("^<[^>]+>","").split("<[^>]+>")

But generally you should avoid using regex with HTML\XML. Try using parser instead like Jsoup.

Community
  • 1
  • 1
Pshemo
  • 122,468
  • 25
  • 185
  • 269
  • how i can remove the "-" in the 2nd sentence? – gb051 Aug 14 '15 at 18:05
  • You could parse results and remove each `-` at start... You could also add this `-` to your delimiter on which you split, like `split("<[^>]+>(\\s*-\\s*)?")`. – Pshemo Aug 14 '15 at 18:07