1

I've made a java server that scrapes a website, but my problem is that after a few requests (about 10 or so) I always get this error ElementNotFoundException, although the element should be there. Basically my program just checks every few minutes this website for info but after a few times it just gives me that exception. This is my code for scraping, I don't know what's wrong with it that after a few times the element is not found..

final WebClient webClient = new WebClient();
try (final WebClient webClient1 = new WebClient()) {
    final HtmlPage page = webClient.getPage("http://b7rabin.iscool.co.il/מערכתשעות/tabid/217/language/he-IL/Default.aspx");

    WebResponse webResponse = page.getWebResponse();
    String content = webResponse.getContentAsString();
     //   System.out.println(content);


    HtmlSelect select = (HtmlSelect) page.getElementById("dnn_ctr914_TimeTableView_ClassesList");
    HtmlOption option = select.getOptionByValue("" + userClass);

    select.setSelectedAttribute(option, true);

    //String jscmnd = "javascript:__doPostBack('dnn$ctr914$TimeTableView$btnChangesTable','')";
    String jscmnd = "__doPostBack('dnn$ctr914$TimeTableView$btnChanges','')";

    ScriptResult result = page.executeJavaScript(jscmnd);

    HtmlPage page1 = (HtmlPage) result.getNewPage();

    String content1 = page1.getWebResponse().getContentAsString();
    //System.out.println(content1);
    System.out.println("-----");
    HtmlDivision getChanges = null;
    String changes = "";

    getChanges = page1.getHtmlElementById("dnn_ctr914_TimeTableView_PlaceHolder");   
    changes = getChanges.asText();
    changes = changes.replaceAll("\n", "").replaceAll("\r", "");

    System.out.println(changes);
}

The exception:

Exception in thread "Thread-0" com.gargoylesoftware.htmlunit.ElementNotFoundException: elementName=[*] attributeName=[id] attributeValue=[dnn_ctr914_TimeTableView_PlaceHolder]
at com.gargoylesoftware.htmlunit.html.HtmlPage.getHtmlElementById(HtmlPage.java:1552)
at scrapper$1.run(scrapper.java:108)

I am really desperate to solve it, it's the only bottleneck in my project.

Ahmed Ashour
  • 5,179
  • 10
  • 35
  • 56
Eldar Azulay
  • 271
  • 1
  • 3
  • 17
  • Was my answer for your other question regarding HtmlUnit of any help? – RBRi Feb 12 '17 at 19:02
  • @RBRi Probably yeah, this is the only problem my program has now before I release it to my friends. – Eldar Azulay Feb 12 '17 at 19:08
  • it might be a good idea to vote for that answer before asking for more help. Please keep in mind we are all doing this in our spare time.... Why do you call this __doPostBack thing? – RBRi Feb 12 '17 at 19:10
  • @RBRi I will! and I did that because the website I scraped is building elements by javascript commands, so I called the js command in order to get the element I want. I didn't name them that way anyway. – Eldar Azulay Feb 12 '17 at 20:12
  • HtmlUnit does all the js stuff for you, e.g. if you select an option from a select element the associated onchange handler will be processed automatically. – RBRi Feb 12 '17 at 22:40
  • @RBRi It used this thing you've said but it's just for selecting the desired class, but the user has to choose the tab he wants to see, like tab#1, tab#2, tab#3, and for that there are no handlers, the user have to click on each tab in order to open the table and data it contains. – Eldar Azulay Feb 12 '17 at 22:45
  • Because i cant read the language of that page i guess you like to click to switch to a different tag. For doing this you have to find the anchor and click it (e.g. findElenentById('...').click()). This will exactly simulate all the things a browser does, when you click on that tab (including the execution of the javascript handlers). – RBRi Feb 12 '17 at 22:46
  • @RBRi I get what you are saying, but it's not the problem here, watch out that my problem wasn't with the javascript execution, but with finding the element which contains the info that I want. – Eldar Azulay Feb 12 '17 at 22:51
  • One more point; looks like the javascript does a server roundtrip. It might be a good idea to wait a bit to let async js finishing and after that you might switch to the now refreshed page – RBRi Feb 12 '17 at 22:51
  • @RBRi So how can I do what you are saying? I don't understand if it should be in the client side or server side. – Eldar Azulay Feb 13 '17 at 13:21

1 Answers1

2

You just need to wait a little before manipulating the second page, as hinted here.

So, sleep() for 3 seconds would make it always succeeds.

HtmlPage page1 = (HtmlPage) result.getNewPage();

Thread.sleep(3_000); // sleep for 3 seconds

String content1 = page1.getWebResponse().getContentAsString();

Also, you don't need to instantiate two instances of WebClient.

Ahmed Ashour
  • 5,179
  • 10
  • 35
  • 56
  • Thank you very much!! it worked! Now I have really minor and stupid problem.. My program streams non-Latin characters to the client, so when I run the server on eclipse, the client gets the data fine, but when I'm running it by just launching the jar file or with windows cmd, the client gets just gibberish. I made the server to save the output to file so I can be sure that the problem is not and server, and the problem is not with the server because the data in the file is right in the language it should be, but the client still gets gibberish. Do you know maybe how can I make it work? – Eldar Azulay Feb 14 '17 at 14:40
  • Have look in http://stackoverflow.com/questions/388490/unicode-characters-in-windows-command-line-how/388500#388500 – Ahmed Ashour Feb 14 '17 at 15:03
  • @Ahmend Ashour I don't understand how I set this chcp 65001 And the console does show me the language that I want, just the client gets the gibberish.. – Eldar Azulay Feb 14 '17 at 15:20
  • It doesn't work... I tried everything and I don't know why I keep getting this problem. Maybe I can use the console that eclipse uses? Do you know how can I run it on eclipse's console without running eclipse? – Eldar Azulay Feb 14 '17 at 15:27