2

I need a text file to contain every title / title of each topic / title of each item in a .txt file each on its own line.

How can I do this or make this if I have already downloaded a freebase rdf dump?

If possible, I also need a separate text file with each topic's / item's description on a single line each description on its own line.

How can I do that?

I would greatly appreciate it if someone could help me make either of these files from a Freebase rdf dump.

Thanks in Advance!

Django Johnson
  • 1,383
  • 3
  • 21
  • 40

1 Answers1

3

Filter the RDF dump on the predicate/property ns:type.object.name. If you only want a particular language, also filter by that language e.g. @en.

EDIT: I missed the second part about descriptions being desired as well. Here's a three part regex which will get you all the lines with:

  1. English names
  2. English descriptions
  3. a type of /commmon/topic

Combining the three is left as an exercise for the reader.

zegrep $'\tns:(((type\\.object\\.name|common\\.topic\\.description)\t.*@en)|type\\.object\\.type\tns:common\\.topic)\\.$' freebase-rdf-2013-06-30-00-00.gz | gzip > freebase-rdf-2013-06-30-00-00-names-descriptions.gz

It seems to have a performance issue that I'll have to look at. A simple grep of the entire file takes ~11 min on my laptop, but this has been running several times that. I'll have to look at it later though...

Tom Morris
  • 10,490
  • 32
  • 53
  • I did something like this with grep, but I got a lot of non-topics in the output file. How would you filter by ns:type.object.name and english-only topics? May you please show me an example using grep or anything else that filters and puts one title per line in a text file? I'd greatly appreciate that! – Django Johnson Aug 16 '13 at 16:07
  • 1
    @DjangoJohnson If you're getting RDF, it's really a much better idea to work with it as RDF and to query it using SPARQL, rather than something that's specific to the serialization format (e.g., [trying to process RDF/XML with XPath](http://stackoverflow.com/a/17052385/1281433)). – Joshua Taylor Aug 16 '13 at 17:11
  • 1
    @JoshuaTaylor Can you recommend a streaming SPARQL query tool which will process this 20GB gzipped file (1.9 billion triples) in a few minutes on a laptop? grep can do that. – Tom Morris Aug 16 '13 at 20:36
  • 1
    I wish that I could :). Current triple stores should be able to *hold* that much without much problem, and to query fast enough, but that doesn't say anything about streaming support. In that case, a solution like `grep` *is* probably much better. As a hybrid, I'd @TomMorris suggest using `grep` to extract just the lines of interest, and loading that into an RDF model to query with SPARQL. The Turtle format won't guarantee that this is possible (since triple patterns can span multiple lines), whereas N-Triples would, but the same data makes it _look_ like they just do one triple per line. – Joshua Taylor Aug 16 '13 at 20:56