1

I'm trying to parse the injected data of the torrents' list on movies.io (for example, here).

I need to parse the whole array of torrent and put it into an array of hash (it already have this structure into the injected code), to use it easily. But I can't seem to find how to do this. I can delete the &quote; and & with gsub! but, that's all I got for now.

The data I recolt would look like this:

  [
    {id: 18210, sha1: 13BB6A6F65EA6203ACE218E830395AE61427EDBD, name: Star Wars Episode IV A New   Hope 1977 1080p Bluray x264 anoXmous},
    {id: 3701, sha1: D3F3C5C237299B2B9F4EC84B7F46F6E9E0424574, name: Star Wars Episode IV A New Hope 1977 720p BRRiP XViD AC3 - IMAGi}
  ]
Simon
  • 619
  • 2
  • 9
  • 23
  • I try to add those directly to an array like a = [] and then append the data to it, or to a hash (I could work with multiple hash by themselves), but I got errors like this SyntaxError: (irb):2: syntax error, unexpected tCONSTANT, expecting '}' ...5EA6203ACE218E830395AE61427EDBD, name: Star Wars Episode IV ... ... ^ (irb):2: syntax error, unexpected tINTEGER, expecting $end ...isode IV A New Hope 1977 1080p Bluray x264 anoXmous} ... ^ from /Users/****/.rvm/rubies/ruby-1.9.3-p0/bin/irb:16:in `
    '
    – Simon Jul 15 '12 at 21:03
  • Well you have unquoted strings there. The "name" key's value in particular. – Andrew Marshall Jul 15 '12 at 21:08
  • That's the raw data I get (after gsub). Do you see a simple way to quote the values ? – Simon Jul 15 '12 at 21:18
  • No. You should provide the code you're using to parse, since it's where the problem is. – Andrew Marshall Jul 15 '12 at 21:42

2 Answers2

4

We also have a proper API endpoint for sources such as torrents, netflix, etc.

For example, http://movies.io/m/1R/sources.json

We're working on a real API with documentation, but it's not ready yet!

fasterthanlime
  • 105
  • 1
  • 1
  • 11
1

So what's happening is: the data-injected attribute you are scraping is in fact just JSON, but it's encoded in HTML. After the browser parses it, it's in the DOM as ordinary JSON.

In fact, you can easily see how it's handled by looking at Scripts in the Chrome JavaScript Console and then clicking Pretty Print in order to keep your sanity. You will see it assign the attribute to f and then later use it with f ? u($.parseJSON(f)) : ....

Since you are presumably using an HTML parser, I think you probably have the real original JSON there somewhere. In any case, some component in your system needs to stop substituting-away the HTML entities that originally supplied the quotes and then you can just feed the string to a JSON parser.

DigitalRoss
  • 143,651
  • 25
  • 248
  • 329