I would like to validate a long list of URL strings, but some of them contain umlaut characters, e.g.: ä, à, è, ö, etc.
Is there a way to configure the Apache Commons UrlValidator to accept these characters?
This test fails (notice the ã):
@Test
public void urlValidatorShouldPassWithUmlaut()
{
// Given
org.apache.commons.validator.routines.UrlValidator validator;
validator = new UrlValidator( new String[] { "http", "https" }, UrlValidator.ALLOW_ALL_SCHEMES );
// When
String url = "http://dbpedia.org/resource/São_Paulo";
// Then
assertThat( validator.isValid( url ), is( true ) );
}
This test passes (ã replaced with a):
@Test
public void urlValidatorShouldPassWithUmlaut()
{
// Given
org.apache.commons.validator.routines.UrlValidator validator;
validator = new UrlValidator( new String[] { "http", "https" }, UrlValidator.ALLOW_ALL_SCHEMES );
// When
String url = "http://dbpedia.org/resource/Sao_Paulo";
// Then
assertThat( validator.isValid( url ), is( true ) );
}
Software version:
<dependency>
<groupId>commons-validator</groupId>
<artifactId>commons-validator</artifactId>
<version>1.4.0</version>
</dependency>
Update:
validator.isValid( IDN.toASCII(url) )
also fails as IDN.toASCII(url)
does things that I don't yet understand, e.g. it converts http://dbpedia.org/resource/São_Paulo
into http://dbpedia.xn--org/resource/so_paulo-w1b
, which is still invalid according to UrlValidator