2

http://www.biletix.com/search/TURKIYE/en#!subcat_interval:12/12/15TO19/12/15

I want to get data from this website. When i use jsoup, it cant execute because of javascript. Despite all my efforts, still couldnot manage.

enter image description here

As you can see, i only want to get name and url. Then i can go to that url and get begin-end time and location.

I dont want to use headless browsers. Do you know any alternatives?

1 Answers1

1

Sometimes javascript and json based web pages are easier to scrape than plain html ones.

If you inspect carefully the network traffic (for example, with browser developer tools) you'll realize that page is making a GET request that returns a json string with all the data you need. You'll be able to parse that json with any json library.

URL is:

http://www.biletix.com/solr/en/select/?start=0&rows=100&fq=end%3A[2015-12-12T00%3A00%3A00Z%20TO%202015-12-19T00%3A00%3A00Z%2B1DAY]&sort=vote%20desc,start%20asc&&wt=json

You can generate this URL in a similar way you are generating the URL you put in your question.

A fragment of the json you'll get is:

....
 "id":"SZ683",
 "venuecount":"1",
 "category":"ART",
 "start":"2015-12-12T18:30:00Z",
 "subcategory":"tiyatro$ART",
 "name":"The Last Couple to Meet Online",
 "venuecode":"BT",
.....

There you can see the name and URL is easily generated using id field (SZ683), for example: http://www.biletix.com/etkinlik/SZ683/TURKIYE/en

------- EDIT -------

Get the json data is more difficult than I initially thought. Server requires a cookie in order to return correct data so we need:

  • To do a first GET, fetch the cookie and do a second GET for obtain the json data. This is easy using Jsoup.
  • Then we will parse the response using org.json.

This is a working example:

//Only as example please DON'T use in production code without error control and more robust parsing
//note the smaller change in server will break this code!!
public static void main(String[] args) throws IOException {
    //We do a initial GET to retrieve the cookie
    Document doc = Jsoup.connect("http://www.biletix.com/").get();
    Element body = doc.head();
    //needs error control 
    String script = body.select("script").get(0).html();

    //Not the more robust way of doing it ...
    Pattern p = Pattern.compile("document\\.cookie\\s*=\\s*'(\\w+)=(.*?);");
    Matcher m = p.matcher(script);
    m.find();
    String cookieName = m.group(1);
    String cookieValue = m.group(2);

    //I'm supposing url is already built
    //removing url last part (json.wrf=jsonp1450136314484) result will be parsed more easily 
    String url = "http://www.biletix.com/solr/tr/select/?start=0&rows=100&q=subcategory:tiyatro$ART&qt=standard&fq=region:%22ISTANBUL%22&fq=end%3A%5B2015-12-15T00%3A00%3A00Z%20TO%202017-12-15T00%3A00%3A00Z%2B1DAY%5D&sort=start%20asc&&wt=json";

    Document document = Jsoup.connect(url)
            .cookie(cookieName, cookieValue) //introducing the cookie we will get the corect results
            .get();
    String bodyText = document.body().text();

    //We parse the json and extract the data
    JSONObject jsonObject = new JSONObject(bodyText);
    JSONArray jsonArray = jsonObject.getJSONObject("response").getJSONArray("docs");
    for (Object object : jsonArray) {
        JSONObject item = (JSONObject) object;
        System.out.println("name = " + item.getString("name"));
        System.out.println("link = " + "http://www.biletix.com/etkinlik/" + item.getString("id") + "/TURKIYE/en");
        //similarly you can fetch more info ...
        System.out.println();
    }
}

I skipped the URL generation as I suppose you know how to generate it.

I hope all the explanation is clear, english isn't my first language so it is difficult for me to explain myself.

fonkap
  • 2,469
  • 1
  • 14
  • 30
  • Thank you. I am using chrome to inspect but i coul dnot manage to reach thart url. How did you manage to reach? After creating that url, should iuse json reader to extract data, right? Please help me to learn to create that url. –  Dec 12 '15 at 21:04
  • Any help? I cant create that link. –  Dec 13 '15 at 13:12
  • When I get some time I'll give you some advice, but it mustn't be very difficult, the link is very similar to the one you posted, you only need some string manipulation and then do an URL encoding ... – fonkap Dec 13 '15 at 14:35
  • Should i try http://stackoverflow.com/questions/34251707/get-json-string-from-url something like that, adding cookie header or only inspecting the page can give yours? –  Dec 13 '15 at 14:37
  • I don't think cookies has nothing to do here. I really don't know where that url come from, I only saw it in network traffic, surely it was javascript generated. You only need to generate it programmatically, if you look you can see that this url has nothing special, only a few parameters and two dates. In fact you could use simply tihs `http://www.biletix.com/solr/en/select/?wt=json` (Try it in a browser, it will print a nice json) – fonkap Dec 13 '15 at 16:46
  • For 1 hour, i am looking network-preview but i couldnot such thing. I also need to add some type of events from left. Region and category. –  Dec 13 '15 at 17:31
  • Try XHR in Network Tab and reload page, you must see the url there – fonkap Dec 13 '15 at 18:03
  • Sorry for late answer, i was taking data from other websites those are not with js. thank you sir :D you are a real life saver. i was looking js tab. i never thought to change that part. and lastly i could get a view like you posted. Now should i use json parse or regex to extract? because for each event, the place of elements (id,venue etc) are not in same row. –  Dec 14 '15 at 23:13
  • Sir, http://www.biletix.com/solr/tr/select/?start=0&rows=100&q=subcategory:tiyatro$ART&qt=standard&fq=region:%22ISTANBUL%22&fq=end%3A%5B2015-12-15T00%3A00%3A00Z%20TO%202017-12-15T00%3A00%3A00Z%2B1DAY%5D&sort=start%20asc&&wt=json&json.wrf=jsonp1450136314484 i created this. But jsoup still can not connect. still javascript error. What am i doing wrong? –  Dec 14 '15 at 23:46
  • Also http://stackoverflow.com/a/4308662/5669287 i tried this with json library but you know, when connect with those methods, response is same, enable javascript :( it seems i cant get data? –  Dec 15 '15 at 00:12
  • You are right, getting the data is not that easy as I thought, I've updated my answer with more info and an example. – fonkap Dec 15 '15 at 16:37