What is the best way to find common elements of multiple text files with java?

Question

I have a program that creates multiple text files of rdf triples. I need to compare the triples and do it fast, what is the best way to do this? I thought of putting the triples into an array and comparing them but there could potentially be hundreds of thousands of triples per file and that would take forever. I need it to be as close to realtime as possible since the triples will be genreated constantly amoung the files. Any help would be great. The files are also in AllegroGraph repository's if it's easier to compare them there somehow.

A thought: if I stored the triples in excel (one triple per row) and one sheet per repository,

A: how could I find the duplicates amoung the sheets. B: would it be fast. and C: how could I automate that from Java?

How do you need to compare them? Are you looking for a complete set of all triples, or do you want to find duplicates? — Keppil, Jun 28 '12 at 14:00
I want to find the triples that are common to 2 or more files and perhaps store them in another file for later reference. — cHam, Jun 28 '12 at 14:28

score 2 · Accepted Answer · edited May 23 '17 at 10:09

You need to build a master index that will store each triple and in how many files it appears and the exact file name and location of the triple within each file. You can search the master index to answer the queries in real-time.

As you update, delete or create new rdf files, you need to update the master index.

You need to store the master index so that it can be updated, searched efficiently.

Simple choice could be to use relational database (like MySql) to store the master index. It can answer you queries like finding common triples with simple select statement select * from rdfindex where triplecount > 2.

EDIT: You cannot store hundreds of thousands of triples in memory using HashMap or similar datastructure. That's why I suggested using database, which can store the data and respond to your queries efficiently. You can look at embedded database like SQLite to store the data.

Read upon these topics

How to create SQLite database and create tables, access tables etc., Create a simple table to store triple, triplecount, filenames.

Convert all your Excel files to CSV files. You can use opencsv to parse the file in Java (check out the samples that come with opencsv).

Parse the CSV files and load the data into SQLite. If the triple is already in the database, then just update the count, if not insert the triple.

That sounds great but I have no idea how to do it. I am entirely new to this and learning as I go. Any advice on how to get started? — cHam, Jun 28 '12 at 15:36

score 0 · Answer 2 · answered Jun 28 '12 at 14:59

0

As far as I know there is a function to delete duplicate entries in AllegroGraph, this may be an option if all the triples come from there.

answered Jun 28 '12 at 14:59

pgras

12,614
4
38
46

What is the best way to find common elements of multiple text files with java?

2 Answers2