8

I'm migrating my code to Java 20.

In this release, the java.net.URL#URL(java.lang.String) got deprecated. Unfortunately, I have a class where I found no replacement for the old URL constructor.

package com.github.bottomlessarchive.loa.url.service.encoder;

import io.mola.galimatias.GalimatiasParseException;
import org.springframework.stereotype.Service;

import java.net.MalformedURLException;
import java.net.URI;
import java.net.URISyntaxException;
import java.net.URL;
import java.util.Optional;

/**
 * This service is responsible for encoding existing {@link URL} instances to valid
 * <a href="https://en.wikipedia.org/wiki/Internationalized_Resource_Identifier">resource identifiers</a>.
 */
@Service
public class UrlEncoder {

    /**
     * Encodes the provided URL to a valid
     * <a href="https://en.wikipedia.org/wiki/Internationalized_Resource_Identifier">resource identifier</a> and return
     * the new identifier as a URL.
     *
     * @param link the url to encode
     * @return the encoded url
     */
    public Optional<URL> encode(final String link) {
        try {
            final URL url = new URL(link);

            // We need to further validate the URL because the java.net.URL's validation is inadequate.
            validateUrl(url);

            return Optional.of(encodeUrl(url));
        } catch (GalimatiasParseException | MalformedURLException | URISyntaxException e) {
            return Optional.empty();
        }
    }

    private void validateUrl(final URL url) throws URISyntaxException {
        // This will trigger an URISyntaxException. It is needed because the constructor of java.net.URL doesn't always validate the
        // passed url correctly.
        new URI(url.getProtocol(), url.getUserInfo(), url.getHost(), url.getPort(), url.getPath(), url.getQuery(), url.getRef());
    }

    private URL encodeUrl(final URL url) throws GalimatiasParseException, MalformedURLException {
        return io.mola.galimatias.URL.parse(url.toString()).toJavaURL();
    }
}

Luckily, I have tests for the class as well:

package com.github.bottomlessarchive.loa.url.service.encoder;

import org.junit.jupiter.params.ParameterizedTest;
import org.junit.jupiter.params.provider.CsvSource;

import java.net.MalformedURLException;
import java.net.URL;
import java.util.Optional;

import static org.assertj.core.api.Assertions.assertThat;

class UrlEncoderTest {

    private final UrlEncoder underTest = new UrlEncoder();

    @ParameterizedTest
    @CsvSource(
            value = {
                    "http://www.example.com/?test=Hello world,http://www.example.com/?test=Hello%20world",
                    "http://www.example.com/?test=ŐÚőúŰÜűü,http://www.example.com/?test=%C5%90%C3%9A%C5%91%C3%BA%C5%B0%C3%9C%C5%B1%C3%BC",
                    "http://www.example.com/?test=random word £500 bank $,"
                            + "http://www.example.com/?test=random%20word%20%C2%A3500%20bank%20$",
                    "http://www.aquincum.hu/wp-content/uploads/2015/06/Aquincumi-F%C3%BCzetek_14_2008.pdf,"
                            + "http://www.aquincum.hu/wp-content/uploads/2015/06/Aquincumi-F%C3%BCzetek_14_2008.pdf",
                    "http://www.aquincum.hu/wp-content/uploads/2015/06/Aquincumi-F%C3%BCzetek_14 _2008.pdf,"
                            + "http://www.aquincum.hu/wp-content/uploads/2015/06/Aquincumi-F%C3%BCzetek_14%20_2008.pdf"
            }
    )
    void testEncodeWhenUsingValidUrls(final String urlToEncode, final String expected) throws MalformedURLException {
        final Optional<URL> result = underTest.encode(urlToEncode);

        assertThat(result)
                .contains(new URL(expected));
    }

    @ParameterizedTest
    @CsvSource(
            value = {
                    "http://промкаталог.рф/PublicDocuments/05-0211-00.pdf"
            }
    )
    void testEncodeWhenUsingInvalidUrls(final String urlToEncode) {
        final Optional<URL> result = underTest.encode(urlToEncode);

        assertThat(result)
                .isEmpty();
    }
}

The only dependency it uses is the galamatias URL library.

Does anyone have any ideas on how could I remove the new URL(link) code fragment while keeping the functionality the same?

I tried various things, like using java.net.URI#create but it did not produce the exact result as the previous solution. For example, URLs that contain non-encoded characters like a space in http://www.example.com/?test=Hello world resulted in an IllegalArgumentException. This was parsed by the URL class without giving an error (and my data contains a lot of these). Also, links that failed the URL conversion like http://промкаталог.рф/PublicDocuments/05-0211-00.pdf are converted to URI successfully with URI.create.

Lakatos Gyula
  • 3,949
  • 7
  • 35
  • 56
  • @Hulk URLs that contain non-encoded characters like a space in "http://www.example.com/?test=Hello world". This was parsed by the URL class without giving an error (and my data contains a lot of these). Also, links that failed the URL conversion like "http://промкаталог.рф/PublicDocuments/05-0211-00.pdf" are converted to URI successfully with URI.create. – Lakatos Gyula Apr 02 '23 at 08:48
  • You are right. I updated my question. – Lakatos Gyula Apr 02 '23 at 19:16
  • How was the space encoded in the old method? I have sometimes seen %20 for a space in a URL. You could add that substitution yourself. Are there some parameters you need to set in the new method to cover more characters? – rossum Apr 02 '23 at 19:25
  • @rossum As seen in the test class. Some of them are already properly URL encoded, while others are not. :/ The URL's construction did not validate against URL encoding. Then the URI was created like this for further validation: `new URI(url.getProtocol(), url.getUserInfo(), url.getHost(), url.getPort(), url.getPath(), url.getQuery(), url.getRef());` – Lakatos Gyula Apr 02 '23 at 20:04

1 Answers1

7

The problem

The main problem seems to be that the UrlEncoder service is dealing with a mix of encoded, unencoded and partially encoded URLs. More than that, there isn't a good way to to know which one is which.

This leads to ambiguity because certain characters can have different meanings when encoded vs unencoded. For instance, given a partially encoded URLs it isn't trivial to tell if a character such as '&' is part of a query parameter (and thus should be encoded) or acting as a separator (and thus shouldn't be encoded):

https://www.example.com/test?firstQueryParam=hot%26cold&secondQueryParam=test

To add salt to the injury, Java's URI implementation deviates from RFC 3986 and RFC 3987 due to historical / backwards compatibility reasons. Here's an interesting read about some of URI's quirks: Updating URI support for RFC 3986 and RFC 3987 in the JDK.

"Fixing" incorrectly encoded URLs by re-encoding without proper knowledge about the original URL is not a trivial problem. Fixing incorrectly encoded URLs using encoders and decoders full of quirks is even harder. A good enough "best effort" heuristic would be my recommendation.

A simple best effort solution

So the good news is that I've managed to implement a solution that passes all of the above tests. The solution in question leverages Spring Web UriUtils and UriComponentsBuilder. The cherry on the cake is that you may not need galimatias anymore.

Here's the code:

public class UrlEncoder {

    public Optional<URL> encode(final String link) {
        try {
            final URI validatedURI = reencode(link).parseServerAuthority();
            return Optional.of(validatedURI.toURL());
        } catch (MalformedURLException | URISyntaxException e) {
            return Optional.empty();
        }
    }

    private URI reencode(String url) { // best effort
        final String decodedUrl = UriUtils.decode(url, StandardCharsets.UTF_8);
        return UriComponentsBuilder.fromHttpUrl(decodedUrl)
                .encode()
                .build()
                .toUri();;
    }
}

Here's the gist of it:

  • reencode → best attempt to "fix" URL encoding by decoding and re-encoding
  • parseServerAuthority() → As an alternative to the former validateUrl(url) method.

Double encoding ampersands and other special characters

As previously stated, while the code above passes all tests. It is easy enough to come up with a "broken" test case. E.g., running the URL above through the encoder would result in:

https://www.example.com/test?firstQueryParam=hot&cold&secondQueryParam=test

This is a perfectly valid URL, but likely not what one would be looking for.

We are now entering dangerous territory, but there are ways to implement a more "opinionated" re-encoding algorithm. E.g. the code bellow deals with ampersands by making sure that %26 isn't decoded:

private final char PERCENT_SIGN = '%';
private final String ENCODED_PERCENT_SIGN = "25";
private final String[] CODES_TO_DOUBLE_ENCODE = new String[]{
        "26" // code for '&'
};

private URI reencode(String url) throws URISyntaxException {
    final String urlWithDoubleEncodedSpecialCharacters = doubleEncodeSpecialCharacters(url);
    final String decodedUrl = UriUtils.decode(urlWithDoubleEncodedSpecialCharacters, StandardCharsets.UTF_8);
    final String encodedUrl = UriComponentsBuilder.fromHttpUrl(decodedUrl).toUriString();
    final String encodedUrlWithSpecialCharacters = decodeDoubleEncodedSpecialCharacters(encodedUrl);

    return URI.create(encodedUrlWithSpecialCharacters);
}

private String doubleEncodeSpecialCharacters(String url) {
    final StringBuilder sb = new StringBuilder(url);
    for (String code : CODES_TO_DOUBLE_ENCODE) {
        final String codeString = PERCENT_SIGN + code;
        int index = sb.indexOf(codeString);
        while (index != -1) {
            sb.insert(index + 1, ENCODED_PERCENT_SIGN);
            index = sb.indexOf(codeString, index + 3);
        }
    }
    return sb.toString();
}

private String decodeDoubleEncodedSpecialCharacters(String url) {
    final StringBuilder sb = new StringBuilder(url);
    for (String code : CODES_TO_DOUBLE_ENCODE) {
        final String codeString = PERCENT_SIGN + ENCODED_PERCENT_SIGN + code;
        int index = sb.indexOf(codeString);
        while (index != -1) {
            sb.delete(index + 2, index + 4);
            index = sb.indexOf(codeString, index + 5);
        }
    }
    return sb.toString();
}

The solution above can be modified to deal with other escaping sequences (e.g., to deal with all RFC 3986's Reserved Characters), as well as to use more sophisticated heuristics (e.g., to do something different with query parameters, than, say, path parameters).

Nevertheless, as someone that went down this rabbit hole before, I can tell you that once you know that you are dealing with incorrectly encoded URLs outside of your control there simply isn't a perfect solution.

Anthony Accioly
  • 21,918
  • 9
  • 70
  • 118
  • 1
    Woahh, thanks a great deal! This is one of the best answers I got to my question since I'm on this site! I'll award you the 200 points bounty as soon as I can (in 24 hours). – Lakatos Gyula Apr 04 '23 at 08:51
  • Thanks for the kind words and for the bounty :). – Anthony Accioly Apr 04 '23 at 10:10
  • I have a similar problem: https://stackoverflow.com/questions/75966165/how-to-replace-the-deprecated-url-constructors-in-java-20 Please can you provide a solution without Spring? Jakarta EE has UriBuilder too: https://javadoc.io/doc/jakarta.platform/jakarta.jakartaee-api/latest/jakarta/ws/rs/core/UriBuilder.html – gouessej Apr 17 '23 at 08:12