7

How can I remove these:

<td>&nbsp;</td>

or

<td width="7%">&nbsp;</td>

from my JSoup 'Document'? I've tried many methods, but these non-breaking space characters do not match anything with normal JSoup expressions or Selectors.

animuson
  • 53,861
  • 28
  • 137
  • 147
Nick Betcher
  • 2,026
  • 5
  • 19
  • 25
  • 1
    Is it not possible to open the document in an IDE or text editor like Notepad++ and the do a find and replace? Or do you mean you need to do it programmatically? – tw16 Aug 12 '11 at 01:39
  • JSoup is a library that parses FETCHED HTML data for an application. So no, what you're suggesting is not only not possible, but not applicable. :) – Nick Betcher Aug 12 '11 at 01:40
  • Does this apply to the entire document or only within `` elements? By the way, are you aware that MSIE browser has rendering problems with completely empty `` elements? A ` ` is namely a classic workaround for this MSIE misbehaviour. – BalusC Aug 12 '11 at 01:44
  • Have you tried something like `response.replaceAll("&nbsp", "")` before it goes through Jsoup? – tw16 Aug 12 '11 at 01:45
  • @tw16 I want to remove the entire line, not just the  . Plus, I am using JSoup.connect("http://www.blah.com").get() which doesn't allow you to modify the document before parsing. – Nick Betcher Aug 12 '11 at 01:48
  • "The entire line" is too ambiguous. HTML does not have notion of "lines". You should then really feed `URL#openStream()` through a `BufferedReader` and then ignore the `readLine()` whenever it `contains(" ")`. – BalusC Aug 12 '11 at 01:51
  • @BalusC I'm certain there has to be a way to delete every element that has   in it, so I would rather not have to go to all of that work as a workaround. The JSoup website suggests getting help on Stackoverflow with #jsoup tag, but so far this is a very simple issue which remains unresolved. :( – Nick Betcher Aug 12 '11 at 07:51
  • It is possible. But you said to remove the entire line. This is not possible with Jsoup. You can only select and remove elements, not lines. – BalusC Aug 12 '11 at 12:08

1 Answers1

15

The HTML entity &nbsp; (Unicode character NO-BREAK SPACE U+00A0) can in Java be represented by the character \u00a0. Assuming that you want to remove every element which contains that character as own text (and thus not every line as you said in a comment), then the following ought to work:

document.select(":containsOwn(\u00a0)").remove();

If you really mean to remove the entire line then your best bet is really to scan HTML yourself line by line.

BalusC
  • 1,082,665
  • 372
  • 3,610
  • 3,555
  • document.select(":containsOwn(\u00a0)").remove(); it is removing all elements having space but I want to remove only any tag having   https://jsoup.org/apidocs/org/jsoup/select/Selector.html#:~:text=p%3AcontainsOwn(jsoup)%20finds,The%20search%20is%20case%20insensitive. then I tried this :containsWholeOwnText(" ") but its throwing error unexpected toke at containsWholeOwnText(\u00a0) I am not sure what's the issue – asifaftab87 Nov 24 '22 at 02:57
  • *"it is removing all elements having space"*. False. It is removing all elements having non-breaking space, exactly the `\u00a0` character represented by the HTML entity ` `. This is already explained in the answer. If it is not working for you, then I guess your actual problem is different. Are you perhaps trying to say that you're seeing **literally** ` ` among the text presented by the web browser itself? So in other words, the HTML source code contains *actually* `&nbsp;` instead of ` `? – BalusC Nov 24 '22 at 10:20