New line character handling in Jsoup

Question

When parsing html with JSoup if there is a new line character in a string of text it treats it as if it is not there. Consider: This string of text will wrap here because of a new line character. But when JSoup parses this string it returns This string of text will wraphere because of a new line character. Note that the newline character does not even become a space. I just want it to be returned with a space. This is the text within a node. I have seen other solutions on stackoverflow where people want or don't want a line break after a tag. That is not what I want. I simply want to know if I can modify the parse function to return not ignore new line characters.

Have you already looked at this: http://stackoverflow.com/questions/5640334/how-do-i-preserve-line-breaks-when-using-jsoup-to-convert-html-to-plain-text? — Yeshodhan Kulkarni, May 14 '17 at 20:12
Yes. That is for breaks that occur because of starting a new paragraph by using tags `
` or `
` in the html. What I am referring to is a new line that occurs because of the ascii character ` CR ` or ` FL ` or `CR+LF`. Jsoup can identify where the tags `
` or `
` are and the solution you referenced is based on them. Jsoup does not seem to have a way to recognize these ascii characters and treat them separately. — kotval, May 14 '17 at 20:44

score 0 · Answer 1 · edited May 23 '17 at 12:26

0

Can you try, getWholeText based on answers here: Prevent Jsoup from discarding extra whitespace

/**
 * @param cell element that contains whitespace formatting
 * @return
 */
public static String getText(Element cell) {
    String text = null;
    List<Node> childNodes = cell.childNodes();
    if (childNodes.size() > 0) {
        Node childNode = childNodes.get(0);
        if (childNode instanceof TextNode) {
            text = ((TextNode)childNode).getWholeText();
        }
    }
    if (text == null) {
        text = cell.text();
    }
    return text;
}

edited May 23 '17 at 12:26

Community

1
1

answered May 14 '17 at 21:10

Yeshodhan Kulkarni

2,844
1
16
19

This works but only for the text in each element. This would require me to know what elements there are and call this method on each one. What I really want is for Jsoup.parse(html) to return cleaned html with all ascii characters remaining. I am parsing very poorly formatted html and am having to result to turning the contents of elements into strings, and then formatting the strings and taking substrings based on predetermined text matching. I just need to be able to know where the line breaks are so that I can split the string on that. – kotval May 14 '17 at 22:24

score 0 · Accepted Answer · answered May 15 '17 at 00:07

I figured it out. I made a mistake in getting the html from the url. I was using this method:

public static String getUrl(String url) {
    URL urlObj = null;
    try{
        urlObj = new URL(url);
    }
    catch(MalformedURLException e) {
        System.out.println("The url was malformed!");
        return "";
    }
    URLConnection urlCon = null;
    BufferedReader in = null;
    String outputText = "";
    try{
        urlCon = urlObj.openConnection();
        in = new BufferedReader(new InputStreamReader(urlCon.getInputStream()));
        String line = "";
        while((line = in.readLine()) != null){
            outputText += line;
        }
        in.close();
    }
    catch(IOException e){
        System.out.println("There was an error connecting to the URL");
        return "no";
        }
    return outputText;
}

When I should have been using the following:

public static String getUrl(String url) {
    URL urlObj = null;
    try{
        urlObj = new URL(url);
    }
    catch(MalformedURLException e) {
        System.out.println("The url was malformed!");
        return "";
    }
    URLConnection urlCon = null;
    BufferedReader in = null;
    String outputText = "";
    try{
        urlCon = urlObj.openConnection();
        in = new BufferedReader(new InputStreamReader(urlCon.getInputStream()));
        String line = "";
        while((line = in.readLine()) != null){
            outputText += line + "/n";
        }
        in.close();
    }
    catch(IOException e){
        System.out.println("There was an error connecting to the URL");
        return "no";
        }
    return outputText;
}

The problem had nothing to do with JSoup. I thought I would make note of it here since I copied this code from Instant Web Scraping with Java by Ryan Mitchell and anyone else following this tutorial might have this same issue.

New line character handling in Jsoup

2 Answers2