I am trying to scrape this website
https://www.dailystrength.org/search?query=aspirin&type=discussion
to gain a data-set for a project I have (using aspirin as placeholder search item).
I have decided to use Jsoup to make a crawler. But the problem is that the posts are dynamically brought with Ajax request. The request is made using Show more button
This button causes the problems
When the entire content is shown it should look like this with the text "All Messages Loaded"
import java.io.IOException;
import java.util.ArrayList;
import java.util.logging.Level;
import java.util.logging.Logger;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import org.openqa.selenium.*;
import org.openqa.selenium.chrome.*;
/**
*
* @author Ahmed
*/
public class Crawler {
public static void main(String args[]) {
Document search_result;
String requested[] = new String[]{"aspirin"/*, "Fentanyl"*/};
ArrayList<Newsfeed_item> threads = new ArrayList();
String query = "https://www.dailystrength.org/search?query=";
try {
for (int i = 0; i < requested.length; i++) {
search_result = Jsoup.connect(query+requested[i]+"&type=discussion").get();
Elements posts = search_result.getElementsByClass("newsfeed__item");
for (Element item : posts) {
Elements link=item.getElementsByClass("newsfeed__btn-container posts__discuss-btn");
Newsfeed_item currentItem=new Newsfeed_item();
currentItem.replysLink=link.attr("abs:href");
Document reply_result=Jsoup.connect(currentItem.replysLink).get();
Elements description = reply_result.getElementsByClass("posts__content");
currentItem.description=description.text();
currentItem.subject=requested[i];
System.out.println(currentItem);
}
}
} catch (IOException ex) {
Logger.getLogger(Crawler.class.getName()).log(Level.SEVERE, null, ex);
}
}
}
This code gives me only the few posts that are shown and not the hidden posts. I understood that JSoup can't be used for this issue so I tried to find sources for selenium to show the full content and download it for crawling.
I can't find any sources, and the only code found to try for initial understanding from
https://www.youtube.com/watch?v=g1IbI_qYsDg
Gives me this error
Exception in thread "main" java.lang.IllegalStateException: The path to the driver executable must be set by the webdriver.gecko.driver system property; for more information, see https://github.com/mozilla/geckodriver. The latest version can be downloaded from https://github.com/mozilla/geckodriver/releases
at com.google.common.base.Preconditions.checkState(Preconditions.java:847)
at org.openqa.selenium.remote.service.DriverService.findExecutable(DriverService.java:134)
at org.openqa.selenium.firefox.GeckoDriverService.access$100(GeckoDriverService.java:44)
at org.openqa.selenium.firefox.GeckoDriverService$Builder.findDefaultExecutable(GeckoDriverService.java:167)
at org.openqa.selenium.remote.service.DriverService$Builder.build(DriverService.java:355)
at org.openqa.selenium.firefox.FirefoxDriver.toExecutor(FirefoxDriver.java:190)
at org.openqa.selenium.firefox.FirefoxDriver.<init>(FirefoxDriver.java:147)
at org.openqa.selenium.firefox.FirefoxDriver.<init>(FirefoxDriver.java:125)
at SeleniumTest.main(SeleniumTest.java:14)
C:\Users\Ahmed\AppData\Local\NetBeans\Cache\8.2\executor-snippets\run.xml:53: Java returned: 1
BUILD FAILED (total time: 0 seconds)
Any help or sample code or alternatives? I just need to get the full page and I use the crawler I have. Or make a completely new crawler but than I can't find code and I run into errors.