1

I'm using Duke for record linkage and in a basic test I get this exception java.lang.ArrayIndexOutOfBoundsException: 1000 from CSVReader.

This is my Java class:

Configuration config = ConfigLoader.load("resources/dukeConfiguration.xml");
    Processor proc = new Processor(config);
    proc.addMatchListener(new PrintMatchListener(true, true, true, false,
                                                 config.getProperties(),
                                                 true));
    proc.link();
    proc.close();

and this one is the configuration file:

<duke>

<schema>
    <threshold>0.7</threshold>

    <property type="id">
        <name>ID</name>
    </property>

    <property>
        <name>TITLE</name>
        <comparator>no.priv.garshol.duke.comparators.Levenshtein</comparator>
        <low>0.09</low>
        <high>0.93</high>
    </property>
    <property>
        <name>ARTIST</name>
        <comparator>no.priv.garshol.duke.comparators.Levenshtein</comparator>
        <low>0.04</low>
        <high>0.73</high>
    </property>
</schema>

<group>
    <jdbc>
        <param name="driver-class" value="com.mysql.jdbc.Driver" />
        <param name="connection-string" value="jdbc:mysql://localhost:3306/digitalmusic" />
        <param name="user-name" value="root" />
        <param name="password" value="root" />
        <param name="query" value="select * from inventory" />

        <column name="idsong" property="ID" />
        <column name="title" property="TITLE" />
        <column name="artist" property="ARTIST" />
    </jdbc>
</group>

<group>
    <csv>
        <param name="input-file" value="/home/mongo.csv" />
        <param name="header-line" value="false" />

        <column name="1" property="ID" />
        <column name="2" property="TITLE" />
        <column name="3" property="ARTIST" />
    </csv>
</group>

</duke>

Someone knows where is the problem?

Stacktrace:

Records: 0

Records: 40000

Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 1000
    at no.priv.garshol.duke.utils.CSVReader.next(CSVReader.java:70)
    at no.priv.garshol.duke.datasources.CSVDataSource$CSVRecordIterator.findNextRecord(CSVDataSource.java:170)
    at no.priv.garshol.duke.datasources.CSVDataSource$CSVRecordIterator.next(CSVDataSource.java:198)
    at no.priv.garshol.duke.datasources.CSVDataSource$CSVRecordIterator.next(CSVDataSource.java:111)
    at no.priv.garshol.duke.Processor.linkRecords(Processor.java:362)
    at no.priv.garshol.duke.Processor.link(Processor.java:319)
    at no.priv.garshol.duke.Processor.link(Processor.java:298)
    at no.priv.garshol.duke.Processor.link(Processor.java:285)
    at duke.DukeCollecting.main(DukeCollecting.java:20)
randrade86
  • 346
  • 1
  • 10
Edoardo Basili
  • 109
  • 1
  • 9

1 Answers1

1

OK, here is your problem.

According to the latest source posted @ GitHub, when you instantiate a new CSVReader, this happens:

public CSVReader(Reader in, int buflen, String file) throws IOException {
    this.buf = new char[buflen];
    this.pos = 0;
    this.len = in.read(buf, 0, buf.length);
    this.tmp = new String[1000];
    this.in = in;
    this.separator = ','; // default
    this.file = file;

}

According to your stacktrace, the error is happening in this block:

if (escaped_quote)
    tmp[colno++] = unescape(new String(buf, prev + 1, pos - prev - 1));
  else
    tmp[colno++] = new String(buf, prev + 1, pos - prev - 1);

The problem is that the CSVReader colno is bigger than the previous allocated array capacity of 1000, hence generating a java.lang.ArrayIndexOutOfBoundsException

Those are your options IMHO:

  • Option 1: Get the source (forking the project), increase the tmp buffer until your program is running without errors and recompile; or

  • Option 2: Check the GitHub project page to see if there are any open issues regarding this problem (or just open one issue) and figure out if there is any malformed information in your files that could cause that array overflow.

I recommend the Option 2 unless you are in a hurry.

Good luck!

randrade86
  • 346
  • 1
  • 10