5

I'm making a bittorrent tracker/website similar to The Pirate Bay, Kickass.to, etc. It's necessary to retrieve torrent statistics (seeders, downloads) in both the index and torrent page. Example:

http://kat.cr/ubuntu-15-04-vivid-vervet-desktop-amd64-iso-final-t10550003.html
Seeders: 3442 Leechers: 148

If the torrent is using my tracker, it's easy to quickly retrieve the data for both pages. However, if the torrent is using a different tracker, I would need to scrape its statistics from said tracker (making requests to it), but that usually takes a few seconds for each torrent and obviously, I can't make the users wait that long to see the listing.

I made a script that scrapes the latest 90 torrents running in background, but I'm afraid that it's not enough. The website will grow up, and total torrents will probably be over 5000. I don't think scraping that many torrents in background will work.

How can I do this?

Encombe
  • 2,003
  • 1
  • 17
  • 26
Jesús León
  • 158
  • 11

2 Answers2

3

When the open trackers was using http you could usually do a full scrape following the Tracker 'scrape' Convention.
Now when trackers uses UDP instead, it's not possible to do a full scrape any longer.

As a alternative, some open trackers publish full scrapes on their web sites:

Other trackers may or may not give access to such file upon request.

Encombe
  • 2,003
  • 1
  • 17
  • 26
  • Yes, scraping is not an issue. However, scraping tons of torrents to get the data is. – Jesús León Jul 05 '15 at 19:37
  • With a full scrape or downloading above links you get the scrape info for all the torrents currently announced to the tracker in one go. Then you only need to extract the info for the torrents you want. – Encombe Jul 05 '15 at 19:50
3

The following strategies to obtain statistics are available, listed in descending order of efficiency:

  1. full scrape via scrape interface - used to be common, less so today on large trackers due to the traffic it causes
  2. full scrape via custom export URLs - you'll have to ask the tracker admins. sometimes to documented on their websites
  3. UDP multi-scrape
  4. HTTP multi-scrape via /scrape?info_hash=A&info_hash=B&info_hash=C - some trackers support it, some don't.
  5. http single-scrape
  6. DHT scrape
  7. joining the swarm and measuring via PEX
the8472
  • 40,999
  • 5
  • 70
  • 122
  • 1
    Interesting. Would you mind explaining more about #6 and #7? – Jesús León Jul 06 '15 at 09:49
  • 2
    they are at the end of the list for a reason. just there fore completeness. they're unlikely to be useful for your case. but I added links. – the8472 Jul 06 '15 at 10:09
  • Hmm. I think it cannot be done, at least not easily, so I'll just recommend users to use the tracker. – Jesús León Jul 06 '15 at 10:26
  • You mention ~5000 torrents as your order of magnitude. That would seem fairly manageable to me. You just shouldn't build the statistics on demand, you can fetch them in the background. But personally I wouldn't consider PHP the right tool for the job – the8472 Jul 06 '15 at 10:30
  • I made a script that fetches them in background. However, 5000 was an example. I don't know how much it'll grow. So what you're recommending is: just scrape them in background? – Jesús León Jul 06 '15 at 12:46
  • scrape them in the background with the most efficient options available per tracker and apply some heuristics which ones to scrape more often and which less often. users probably want "fresh" stats on new torrents while day-old statistics might be good enough on older ones. – the8472 Jul 06 '15 at 13:17
  • That seems like the best choice. Got it! – Jesús León Jul 06 '15 at 13:34