-2

i am trying to extract the sentence:

str="<a href=\"https://blabla.com/truck\">truck</a> A wheeled, powered motor vehicle used for transportation."

result:

truck A wheeled, powered motor vehicle used to tranportation.

cant find a way to extract the sentence normaly using regex, everytime something is missing.

edit: the structure of the desired result is the word which comes before the "</a>" sign, and the rest of the sentence right after (the word is random and the sentence as well).

Pshemo
  • 122,468
  • 25
  • 185
  • 269
gb051
  • 13
  • 4
  • Can you show what you've tried so far? Can you extract `truck` and whatever goes after the and concatenate those together? – Deja Vu Aug 04 '15 at 00:17

1 Answers1

2

In this case simple removal of text between < and > should do the trick:

String str="<a href=\"https://blabla.com/truck\">truck</a> A wheeled, powered motor vehicle used for transportation.";
System.out.println(str.replaceAll("<[^>]*>", ""));

But generally avoid using regex to parse HTML. There are many potential problems with it. You can read about them here:

Use proper parser like Jsoup which can do all the hard work for you.

String str="<a href=\"https://blabla.com/truck\">truck</a> A wheeled, powered motor vehicle used for transportation.";
Document doc = Jsoup.parse(str);
String text = doc.text();//get text which this HTML structure will generate in browser
System.out.println(text);

Result: truck A wheeled, powered motor vehicle used for transportation.

Community
  • 1
  • 1
Pshemo
  • 122,468
  • 25
  • 185
  • 269
  • i cant use jsoup library in this project,any way to use it with regex ? – gb051 Aug 03 '15 at 22:13
  • Generally it is very bad choice to use regex to parse complex XML/HTML so answer depends on complexity of your text. You would need to update your question with possibly all details about structure you want to parse. You could simply try removing everything which exists between `<` and `>` but if there is a chance that you can have some `
    ` or `<![CDATA[ ...]]` sections this may fail.
    – Pshemo Aug 03 '15 at 22:16
  • @gb051 I updated my answer to provide simple solution for this case, but I can't guarantee that this will work for rest of your cases. – Pshemo Aug 03 '15 at 22:27
  • In regular expression, put '?' after '*' to make the regular expression less greedy. – suztomo Aug 03 '15 at 22:30
  • @gonbe Why? I used `[^>]*` precisely to avoid `.*?` and its backtracking. – Pshemo Aug 03 '15 at 22:33
  • Then it should work. – suztomo Aug 03 '15 at 22:37