1

I want to develop an app to display the programs schedule of a specific channel from its website. I don't have their website indeed, however, is there some other techniques to retrieve some specific data from a page, in my case the name of the program and its diffusion time. The website does not have an RSS feed too. Any ideas please? Thank you very much.

androniennn
  • 3,117
  • 11
  • 50
  • 107
  • An idea other than parsing the HTML ? – Nir Alfasi Jun 14 '12 at 19:22
  • 1
    I assume you are talking about [web scraping](http://en.wikipedia.org/wiki/Web_scraping)? A visit to the [jsoup](http://www.jsoup.org) homepage might help. – Jens Jun 14 '12 at 19:23
  • If there's no feed/XML for you to read, the only option I know of left would be downloading the HTML data from the webpage, looking through it for specific elements/identifiers, then scraping the data you need directly from that. I've only done it with a phpBB forum crawling bot I made years ago, but it's not pretty nor adaptable to many pages, and if those identifiers don't exist, you're outa luck completely. – Cruceo Jun 14 '12 at 19:25

2 Answers2

3

Do you own the website? If not you need to scrape the website for it's data and what you do with the data then may be subject to legal issues.

Scraping data is basically just ingesting the HTML and parsing out fields in the page that contain the information you want. It can be fairly simple if the website is structured well. Perhaps you could use JSOUP

See this thread for more details

Community
  • 1
  • 1
Matt Wolfe
  • 8,924
  • 8
  • 60
  • 77
  • So, that may be not legit? And if the informations are changed in the website then it will be changed too in the app? I've seen JSOUP and i guess that it's not very complicated :/. I hope... – androniennn Jun 14 '12 at 19:27
  • 1
    Most websites don't care but some may. Your app won't change until you've parsed the latest data. Most likely you'd want to do your parsing on a server somewhere, and your mobile app should connect to your server to get the latest data.. This way you don't have all your android users connecting to this webpage and scraping it. – Matt Wolfe Jun 14 '12 at 19:28
  • 1
    I actually like the idea of asking the site owner if they publish a public API to their data as libjup mentioned. This would be the most effective way but they may not offer one. If they offer an API you may not need to run a server as long as they are fine with all the users you could be adding to the load on their server. – Matt Wolfe Jun 14 '12 at 19:31
  • And just one other question, why the need to store the scraped data to a server(that i don't have one :(), scraping take some much time??? – androniennn Jun 14 '12 at 19:41
  • if you don't have a server you'd need each of your clients to do it. The reason you would want to do it from a server would be that it would require much less resources for both the server that you are scraping and for your end users since you could publish the data to them in a much simpler format. In the meantime you could just have each client scrape the data and see how well it works. – Matt Wolfe Jun 14 '12 at 19:48
  • Okay, i'll begin the project, hopefully i'll succeed to scrape the page, i'll begin with jsoup.org (i guess that i have to copy some jars libraries in the Android project) then i'll test if it does not need so much resources. – androniennn Jun 14 '12 at 19:55
2

You could check the page if it offers an API. If it does you can usually connect to a REST-Service which you then access via a POST or GET request. You usually get a xml or json array.

Alternatively if they do not provide an API you can manually parse the (HTML) data. Though I would not recommend that since most websites forbid that and it won't work as soon as elements are changed.

libjup
  • 4,019
  • 2
  • 18
  • 23