2

I am attempting to use Duke to match records from one database to another. One db has song titles + writers. I am trying to match to another db to find duplicates and corresponding records.

I have gotten duke to run and I can see some of the records getting matched. But no matter what I do, Correct links found = 0% always and I just cant right to the linkfile.

This is what I have done currently:

<duke>
<schema>
    <threshold>0.79</threshold>
    <maybe-threshold>0.70</maybe-threshold>
    <path>test</path>

    <property type="id">
        <name>PublishingID</name>
    </property>

    <property type="id">
        <name>AmgID</name>
    </property>

    <property>
        <name>NAME</name> 
        <comparator>no.priv.garshol.duke.comparators.JaroWinkler</comparator>
        <low>0.12</low>
        <high>0.61</high>
    </property>
    <property>
        <name>TITLE</name> 
        <comparator>no.priv.garshol.duke.comparators.Levenshtein</comparator>
        <low>0.09</low>
        <high>0.93</high>
    </property>
</schema>

<group>
    <jdbc>
        <param name="driver-class" value="com.mysql.jdbc.Driver"/>
        <param name="connection-string" value="jdbc:mysql://127.0.0.1"/>
        <param name="user-name" value="root"/>
        <param name="password" value="root"/>
        <param name="query" value="
            SELECT pSongs.song_id, pSongs.songtitle, pSongs.publisher_id, pWriters.first_name AS writer_first_name, pWriters.last_name AS writer_last_name 
            FROM    devel_matching.publisher_songs AS pSongs
            INNER JOIN devel_matching.publisher_writers as pWriters ON pWriters.publisher_id = pSongs.publisher_id AND pWriters.song_id = pSongs.song_id
            WHERE pSongs.writers LIKE '%LENNON, JOHN%'
            LIMIT 20000;"/>
        <column name="song_id" property="PublishingID"/>
        <column name="songtitle" property="TITLE" cleaner="no.priv.garshol.duke.cleaners.LowerCaseNormalizeCleaner"/>
        <column name="writer_first_name" property="NAME" cleaner = "no.priv.garshol.duke.cleaners.LowerCaseNormalizeCleaner"/>
    </jdbc>
</group>

<group>
    <jdbc>
        <param name="driver-class" value="com.mysql.jdbc.Driver"/>
        <param name="connection-string" value="jdbc:mysql://127.0.0.1"/>
        <param name="user-name" value="root"/>
        <param name="password" value="root"/>
        <param name="query" value="
            SELECT amgSong.id, amgSong.track, SUBSTRING_INDEX(SUBSTRING_INDEX(amgSong.composer, '/', numbers.n), '/', -1) composer
            FROM 
                devel_matching.numbers INNER JOIN devel_matching.track as amgSong
                ON CHAR_LENGTH(amgSong.composer) - CHAR_LENGTH(REPLACE(amgSong.composer, '/', '')) >= numbers.n - 1
            WHERE amgSong.composer like '%lennon%'
            LIMIT 5000;"/>
        <column name="id" property = "AmgID"/>
        <column name="track" property="TITLE" cleaner="no.priv.garshol.duke.cleaners.LowerCaseNormalizeCleaner"/>
        <column name="composer" property="NAME" cleaner = "no.priv.garshol.duke.cleaners.LowerCaseNormalizeCleaner"/>
    </jdbc>
</group>

Output:

  • Total records: 5000
  • Total matches: 8284
  • Total non-matches: 1587
  • Correct links found: 0 / 0 (0.0%)
  • Wrong links found: 0 / 0 (0.0%)
  • Unknown links found: 8284
  • Percent of links correct 0.0%, wrong 0.0%, unknown 100.0%
  • Precision 0.0%, recall NaN%, f-number 0.0

Running on Spring STS: program arguments = --progress --verbose --testfile=linked.txt --testdebug --showmatches duke.xml

Its not writing to linked.txt or finding any correct links. Not sure what I am doing wrong here. Any help would be awesome.

fgregg
  • 3,173
  • 30
  • 37
1000Suns
  • 251
  • 1
  • 3
  • 16

1 Answers1

2

Actually, it is finding 8284 links. --testfile is for giving Duke a file containing known correct links, basically test data. What you want is --linkfile, which writes the links you've found into that file.

I guess I should add code which warns against an empty test file, since that very likely indicates a user error.

You'd probably be better off asking this question on the Duke mailing list, btw.

larsga
  • 654
  • 6
  • 10
  • I read a lot more (found your presentations) I ended up writing a custom listener that keeps the matches in memory before outputing it to a CSV file. That worked just as well. I will be joining that mailing list. If I may ask: we have solr integrated, is it possible to use solr as a datasource and skip the indexing of records? Its been fun working with Duke btw :) – 1000Suns Mar 05 '15 at 21:14
  • Uh, yes, I'm me. :) There has been some work on using ElasticSearch directly, but none on Solr, if I remember correctly. Anyway, it's totally possible to do. There's an issue for it, so you could watch that. https://github.com/larsga/Duke/issues/132 – larsga Mar 06 '15 at 22:10
  • Thanks for replying. I will definitely watch the issue. Final question: is there a way to persist the lucene indices from one process onto another? For instance, we run a deduplication process for X datasource. If new records are added to the datasource, the second deduplication process has to index all record again if I understand correctly? – 1000Suns Mar 09 '15 at 16:18
  • It's perfectly possible to reuse the index. There's an overwrite flag in the API you can use to control this. However, note that the second time you run you need to then use a data source which only returns the new records. Otherwise everything winds up being matched all over again. – larsga Mar 09 '15 at 21:15
  • Mr. Larsga, thank you again for replying. I have been investigating Duke to see if its a good fit. I am curious about RecordLinkage Mode. In the wiki it says it will look for only one match in the other data source. Does this mean that if the datasource has duplicates, running duke as RecordLinkage will not resolve the duplicates? I am dealing with a problem where both datasources most like have duplicates. Deduplication into RecordLinkage is the correct approach? – 1000Suns Mar 19 '15 at 19:28
  • It's actually a little more complicated. You can run record linkage and match multiple records in the other data source, but if both sources have duplicates it may be better to just do deduplication across multiple sources directly. You can do this by just listing several data sources in the same config. – larsga Mar 20 '15 at 12:31
  • I am trying this project, I have a doubt please help me Why we used --testfile , as we have to find duplication in another db records. – Kishore Oct 07 '15 at 11:52
  • Kishore, please write to the mailing list or post a new question. Make sure to explain a little more, because I don't understand what you mean. – larsga Oct 08 '15 at 06:30