Gettting Actor Ids and biographies from the data dumps or Freebase API

Question

Does anyone know the best way of getting Actor Ids from Freebase data dumps, and later on getting the IMDB ids and biographies from the Freebase API?

What have you tried so far? Why get one set of IDs from the dump and the other from the API? — Tom Morris, Jul 10 '13 at 12:57
@Tom Morris I need to update many records I have of actors in my Db. this is why I thought of taking from data dumps. I need the actor free base Id, Imdb Id, wiki Id, Biography and image. later on I will need to update records on a regularly basis, that is why I thought of the Api as well. can you please guide me? — Gidi, Jul 11 '13 at 10:17

Tom Morris · Answer 1 · 2013-07-15T18:20:15.620

4

Actors will have the type /film/actor and look like this in the dump:

ns:m.010q36     rdf:type        ns:film.actor.

You can find them all in a few minutes from the compressed dump with a simple grep:

zgrep $'rdf:type\tns:film.actor.' freebase-rdf-<date of dump>.gz | cut -f 1 | cut -d ':' -f 2 > actor-mids.txt

This will generate a list of MIDs in the form m.010q36 which represents the MID /m/010q36.

Using the list of MIDs, look for all lines which have that MID in the first column, one of your desired properties in the second. You could do this using Python, grep, or the tool/language of your choice. Of course if you're using a programming language like Python, you could roll the initial search.

Wikipedia and IMDB IDs are stored as what Freebase calls keys and look like this (MusicBrainz & Netflix included too):

ns:m.010q36     ns:type.object.key      "/wikipedia/en/Mr$002ERodgers".
ns:m.010q36     ns:type.object.key      "/authority/imdb/name/nm0736872".
ns:m.010q36     ns:type.object.key      "/authority/musicbrainz/87467525-3724-412d-ad3e-595ecb6a3bfd".
ns:m.010q36     ns:type.object.key      "/authority/netflix/role/30006685".

Keys may be encoded (like the Wikipedia key above). You can find documentation on the Freebase wiki on how to deal with them.

edited Jul 15 '13 at 18:20

answered Jul 11 '13 at 15:19

Tom Morris

10,490
32
53

I tried using Cygwin with your zgrep command but it resulted empty. I would prefer to use the API but it is limited to 100,000 queries per day. I am getting really frustrated. What I have is millions of Actors Imdb Ids which I would like to get info on. (such info includes: FreeBase Id, Wiki Id, Actor's Biography and Actor's Image). How can I do that please? – Gidi Jul 15 '13 at 16:07
1

That command was cut from a Cygwin window where I tested it, so it should work. The API isn't really intended for bulk downloads. If you've got actor IDs to start, I would search that way. Something like `zgrep "/authority/imdb/name" freebase-rdf-2013-06-30-00-00.gz | cut -f 1,3` will get you a list of MIDs and they're corresponding IMDB IDs. – Tom Morris Jul 15 '13 at 18:41
assuming my gz file lies in D:\work this what I tried and got "No Such file or directory" error: zgrep "/authority/imdb/name" /d/work/freebase-rdf-2013-04-07-00-00.gz | cut -f 1,3 | cut -d ':' -f 2 > actor-mids.txt and also this: zgrep $'rdf:type\tns:film.actor.' /d/work/freebase-rdf-2013-04-07-00-00.gz | cut -f 1 | cut -d ':' -f 2 > actor-mids.txt – Gidi Jul 16 '13 at 11:40
I'd used `cd` to switch to the directory containing the file first to make sure you're in the right place. On my system the drive root would be d: (which Cygwin accepts as a special case) or /cygdrive/d – Tom Morris Jul 16 '13 at 13:48

Gettting Actor Ids and biographies from the data dumps or Freebase API

1 Answers1

Linked