Change difficult string with unknown substring

Question

Upd: I'm using Jsoup to parse text
When parsing one site, I faced the problem: when I get html text, some of links are corrupted with space in random place. For example:

What a pretty flower! <a href="www.goo gle.com/...">here</a> and <a href="w ww.google.com...">here</a>

As you may notice, the space position is completely random, but one thing is certain: it's inside an href tag. Of course, I can use the replace(" ", "") method, but there may be two or more links. How can I solve this problem?

What's wrong with using `replace(" ", "")` on all href values? Also, why try to fix data from a site that returns garbage? — Ted Hopp, Feb 21 '14 at 18:57
There's also regex which you can use to identify your links if you want to only use `replace` on them. Or [JSoup](http://jsoup.org/) (see [this question](http://stackoverflow.com/questions/9071568/parse-web-site-html-with-java)) — eebbesen, Feb 21 '14 at 19:00
Yes, I'm using Jsoup to parse, but changing substring won't change the initial string, right? — Groosha, Feb 21 '14 at 19:02

score 1 · Accepted Answer · answered Feb 21 '14 at 19:04

This is sort of an old solution, but I'd try to use the old retired apache ECS to parse your html, and then, only for the href links, you could remove the spaces, and then re-create everything :-) If I remember well, there was a way to parse an ECS "DOM" from html.

http://svn.apache.org/repos/asf/jakarta/ecs/branches/ecs/src/java/org/apache/ecs/html2ecs/Html2Ecs.java

Another option is to selectively get your hrefs using something like xpath, but you'd have to deal with malformed html (you could give Tidy a chance - http://infohound.net/tidy/)

score 0 · Answer 2 · answered Feb 21 '14 at 19:32

You could use regular expressions to find and "refine" the URLs:

public class URLRegex {

    /**
     * @param args the command line arguments
     */
    public static void main(String[] args) {

        final String INPUT = "Hello World <a href=\"http://ww w.google.com\">Google</a> Second " + 
                             "Hello World <a href=\"http://www.wiki pedia.org\">Wikipedia</a> Test" + 
                             "<a href=\"https://www.example.o rg\">Example</a> Test Test";
        System.out.println(INPUT);

        // This pattern matches a sequence of one or more spaces.
        // Precompile it here, so we don't have to do it in every iteration of the loop below.
        Pattern SPACES_PATTERN = Pattern.compile("\\u0020+");

        // The regular expression below is very primitive and does not really check whether the URL is valid.
        // Moreover, only very simple URLs are matched. If an URL includes different protocols, account credentials, ... it is not matched.
        // For more sophisticated regular expressions have a look at: http://stackoverflow.com/questions/161738/
        Pattern PATTERN_A_HREF = Pattern.compile("https?://[A-Za-z0-9\\.\\-\\u0020\\?&\\=#/]+");
        Matcher m = PATTERN_A_HREF.matcher(INPUT);

        // Iterate through all matching strings:
        while (m.find()) {
            String urlThatMightContainSpaces = m.group();   // Get the current match
            Matcher spaceMatcher = SPACES_PATTERN.matcher(urlThatMightContainSpaces);
            System.out.println(spaceMatcher.replaceAll(""));  // Replaces all spaces by nothing.
        }

    }
}

Change difficult string with unknown substring

2 Answers2