2

So I'm trying to scrape a grammar website that gives you conjugations of verbs, but I'm having trouble accessing the pages that require accents, such as the page for the verb "fág".

Here is my current code:

    String url = "http://www.teanglann.ie/en/gram/"+ URLEncoder.encode("fág","UTF-8");
    System.out.println(url);

I've tried this both with and without the URLEncoder.encode() method, and it just keeps giving me a '?' in place of the 'á' when working with it, and my URL search returns nothing. Basically, I was wondering if there was something similar to Python's 'urllib.parse.quote_plus'. I've tried searching and tried many different methods from StackOverflow, all to no avail. Any help would be greatly appreciated.

Eventually, I'm going to replace the given string with a user inputed argument. Just using it to test at the moment.

Solution: It wasn't Java, but IntelliJ.

davissandefur
  • 161
  • 1
  • 12
  • Your code works fine. What is the encoding of your "source" file? I feel the test is wrong. – Jayan May 16 '15 at 01:52
  • I'm using IntelliJ if that makes any difference. The result I get when I print is: `http://www.teanglann.ie/en/gram/f%EF%BF%BDg`, which definitely isn't the website I want to be going to. – davissandefur May 16 '15 at 02:00
  • I get "http://www.teanglann.ie/en/gram/f%C3%A1g" which correct one. Are you running java with correct encoding? like _ java -Dfile.encoding=UTF-8 _ – Jayan May 16 '15 at 02:05
  • IntelliJ - file encoding : http://blog.jetbrains.com/idea/2013/03/use-the-utf-8-luke-file-encodings-in-intellij-idea/ – Jayan May 16 '15 at 02:06
  • The issue was IntelliJ's file encoding. I feel really stupid now for not thinking of that. Thank you very much! – davissandefur May 16 '15 at 02:13
  • No need to feel bad. You learnt a lot :). I am providing this as answer for future reference. – Jayan May 16 '15 at 02:49

1 Answers1

1

Summary from comment

The test code works fine.

import java.io.UnsupportedEncodingException;
import static java.net.URLEncoder.encode;

public class MainApp {
    public static void main(String[] args) throws UnsupportedEncodingException {
        String url = "http://www.teanglann.ie/en/gram/"+ encode("fág", "UTF-8");
        System.out.println(url);
    }
}

It emits like below

http://www.teanglann.ie/en/gram/f%EF%BF%BDg

Which would goto correct page.

Correct steps are

  • Ensure that source code encoding is correct. (IntelliJ probably cannot guess it all correct)
  • Run the program with appropriate encoding (utf-8 in this case)

(See What is the default encoding of the JVM? for a relevant discussion)

Edit from Wyzard's comment

Above code works by accident(say does not have whitespace). Correct way to get encoded URL is like bellow ..

 String url = "http://www.teanglann.ie/en/gram/fág";
 System.out.println(new URI(url).toASCIIString());

This uses URI.toASCIIString() which adheres to RFC 2396, which talk about Uniform Resource Identifiers (URI): Generic Syntax

Community
  • 1
  • 1
Jayan
  • 18,003
  • 15
  • 89
  • 143
  • 1
    `URLEncoder`, despite its name, doesn't actually do URL encoding. It does HTML form encoding, which isn't what you want in a URL (except maybe as part of a query string, after a question mark). The key difference is that a space is encoded as "+" instead of "%20". For *actual* URL encoding, you want [`URI.toASCIIString`](https://docs.oracle.com/javase/8/docs/api/java/net/URI.html#toASCIIString--), which follows [RFC 2396](https://www.ietf.org/rfc/rfc2396.txt). – Wyzard May 16 '15 at 04:00
  • Thanks, Did not know about it. I added that into answer. – Jayan May 16 '15 at 04:21