0

how do get remove the html completely and get the remaining text

'Abdulsalami</title><style>.ag8o{position:absolute;clip:rect(434px,auto,auto,434px);}</style><div class=ag8o>Spending time doing you <a href=http://arr'

i want to get 'Abdulsalami'

what would be the regex to do that ?

skcrpk
  • 558
  • 5
  • 18
  • 2
    Is there a good reason you want to try a regex? (There are tons of good reasons not to.) – Jongware Feb 21 '15 at 16:45
  • 2
    possible duplicate of [RegEx match open tags except XHTML self-contained tags](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) – Mureinik Feb 21 '15 at 16:45
  • The input string is not well formed, the tag at the end is "corrupt". I do not understand why you want to ignore
    tag text? Style tag should be ignored, but ... you just should explain what your requirements are. Do you just want to extract text before the first tag? Or the text before tag? Please provide also the language you code in.
    – Wiktor Stribiżew Feb 21 '15 at 17:23
  • sql injection on my database , all rows and columns have data like this , each columns has different ending but the it always starts with so i need to find and replace all sql injected text , i will use regex on sql file which has insert statements – skcrpk Feb 21 '15 at 17:59

1 Answers1

0

A single RegEx can't match all the variations of HTML.

Try using Jsoup.

public static String html2text(String html) {
    return Jsoup.parse(html).text();
}

or Jericho Html parser (you can download it from here - http://jericho.htmlparser.net/docs/index.html)

Source htmlSource = new Source(htmlText);
Segment htmlSeg = new Segment(htmlSource, 0, htmlSource.length());
Renderer htmlRend = new Renderer(htmlSeg);
System.out.println(htmlRend.toString());