5

I want to use the Web-scale Parallel Inference Engine (WebPIE) reasoner over Hadoop platform. I have already implemented Hadoop structure with two Ubuntu virtual machines and it’s functioning well. When I want to use WebPie to do reasoning over RDF files, the process fails due to need of Sequence File format. The WebPIE tutorial mentioned nothing about the Sequence File format as a prerequisite to reasoning in Hadoop. To produce Sequence file format I wrote the following code:

public static void main(String[] args) {

    FileInputStream fis = null;
    SequenceFile.Writer swriter = null;
    try {

        Configuration conf = new Configuration();

        File outputDirectory = new File("output");
        File inputDirectory = new File("input");
        File[] files = inputDirectory.listFiles();

        for (File inputFile : files) {

            //Input
            fis = new FileInputStream(inputFile);

            byte[] content = new byte[(int) inputFile.length()];
            fis.read(content);

            Text key = new Text(inputFile.getName());
            BytesWritable value = new BytesWritable(content);

            //Output
            Path outputPath = new Path(outputDirectory.getAbsolutePath()+"/"+inputFile.getName());

            FileSystem hdfs = outputPath.getFileSystem(conf);

            FSDataOutputStream dos = hdfs.create(outputPath);

            swriter = SequenceFile.createWriter(conf, dos, Text.class,
                    BytesWritable.class, SequenceFile.CompressionType.BLOCK, new DefaultCodec());

            swriter.append(key, value);

        }

        fis.close();
        swriter.close();

    } catch (IOException e) {

        System.out.println(e.getMessage());
    }

}

This code produce correct Sequence File format with some RDF files, but doesn't work 100% correctly, and sometimes produces corrupted files. Is there any solution from beginning to avoid this code, and if there isn't, how can I improve this code to work correctly with any RDF file as input?

Joshua Taylor
  • 84,998
  • 9
  • 154
  • 353
  • Can you tell any more about the error that you're encountering. As you say, the WebPIE tutorial doesn't mention Sequence Files. Can you do everything described in the tutorial as it is written? Do you run into any problems with the tutorial. The Hadoop wiki does talk about [sequence files](http://wiki.apache.org/hadoop/SequenceFile), and might be a useful resource. – Joshua Taylor Jun 23 '13 at 03:53

2 Answers2

0

The tutorial is based on running WebPIE on Amazon EC2, so there may be some difference in configuration. However, note that, according to the tutorial, the inputs are not plain RDF files, but “gzipped compressed files of triples in the N-Triples format” (emphasis in original):

Before we launch the reasoner, we need to upload the input data to the HDFS filesystem and compress it in a suitable format. The input data must consist of gzipped compressed files of triples in N-Triples format. Try to keep files to similar sizes and have more files than cpu cores since each file will be processed by one machine.

The second section, “2nd step: upload input data on the cluster” of that tutorial describes how to actually get the data into the system, and it looks like it should apply to Amazon EC2 as well as your own Hadoop installation. I don't want to simply quote that section in entirety here, but the sequence of commands they give is:

$ ./cmd-hadoop-cluster login webpie
$ hadoop fs -ls /
$ hadoop fs -mkdir /input
$ ./cmd-hadoop-cluster push webpie input_triples.tar.gz

That only gets the data into HDFS, though. In “3rd step: compress the input data”,

The reasoner works with the data in a compressed format. We compress the data with the command:

hadoop jar webpie.jar jobs.FilesImportTriples /input /tmp /pool --maptasks 4 --reducetasks 2 --samplingPercentage 10 --samplingThreshold 1000

… The above command can be read as: launch the compression and split the job between 4 map tasks and 2 reduce tasks, sample the input using a 10% of the data and mark as popular all the resources which appear more than 1000 times in this sample.

After this job is finished, we have in the directory /pool the compressed input data and we can proceed to the reasoning.

The remaining sections discuss reasoning, getting data back out, and so on, which shouldn't be a problem, once you've got the data in, I expect.

Community
  • 1
  • 1
Joshua Taylor
  • 84,998
  • 9
  • 154
  • 353
0

The input data must consist of gzipped compressed files of triples in N-Triples format for exemple (triplePart1.gz, triplePart2.gz ....), so we have: input_triples.tar.gz that contains compressed files of N-triples(triplePart1.gz, triplePart2.gz ....).

  1. uncompress the tar file and copy the content to the HDFS

    ---/hadoop$ tar zxvf /tmp/input_triples.tar.gz /tmp/input_triples .

    ---/hadoop$ bin/hadoop fs -copyFromLocal /tmp/input-files /input .

  2. Compress the input data

    ---/hadoop$ bin/hadoop jar webpie.jar jobs.FilesImportTriples /input /tmp /pool --maptasks 4 --reducetasks 2 --samplingPercentage 10 --samplingThreshold 1000

  3. reasoning

    ---/hadoop$ bin/hadoop jar webpie.jar jobs.Reasoner /pool --fragment owl --rulesStrategy fixed --reducetasks 2 --samplingPercentage 10 --samplingThreshold 1000

to be continued here :-)

A.KAMMOUN
  • 66
  • 1
  • 3
  • if this here is not the complete solution, please edit your answer and add more details until it's useful without your link :-) – kleopatra Sep 16 '13 at 15:52
  • I have mentioned the solution, in fact you have just to compress the N-Triples files in **.gz** format. The problem of sequence file appears when we compress the input data so for the rest you have just to follow the tutorial – A.KAMMOUN Sep 17 '13 at 11:17