As a part of my DSpace instance, I have a SOLR repository containing 12 million usage statistics records. Some records have migrated through multiple SOLR upgrades and do not conform to the current schema. 5 million of these records are missing a unique id field specified in my schema.
The DSpace system provides a mechanism to shard older usage statistics records into a separate solr shard using the following code.
DSPACE SHARD LOGIC:
for (File tempCsv : filesToUpload) {
//Upload the data in the csv files to our new solr core
ContentStreamUpdateRequest contentStreamUpdateRequest = new ContentStreamUpdateRequest("/update/csv");
contentStreamUpdateRequest.setParam("stream.contentType", "text/plain;charset=utf-8");
contentStreamUpdateRequest.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true);
contentStreamUpdateRequest.addFile(tempCsv, "text/plain;charset=utf-8");
statisticsYearServer.request(contentStreamUpdateRequest);
}
statisticsYearServer.commit(true, true);
When I attempted to run this process, I received an error message for each of my records missing the unique id field and the 5 million records were dropped by the process.
I have attempted to replace these 5 million records in order to force the creation of a unique id field on each record. Here is the code that I am running to trigger that update. The query myQuery iterates over batches of several thousand records.
MY RECORD REPAIR PROCESS:
ArrayList<SolrInputDocument> idocs = new ArrayList<SolrInputDocument>();
SolrQuery sq = new SolrQuery();
sq.setQuery(myQuery);
sq.setRows(MAX);
sq.setSort("time", ORDER.asc);
QueryResponse resp = server.query(sq);
SolrDocumentList list = resp.getResults();
if (list.size() > 0) {
for(int i=0; i<list.size(); i++) {
SolrDocument doc = list.get(i);
SolrInputDocument idoc = ClientUtils.toSolrInputDocument(doc);
idocs.add(idoc);
}
}
server.add(idocs);
server.commit(true, true);
server.deleteByQuery(myQuery);
server.commit(true, true);
After running this process, all of the records in the repository have a unique id assigned. The records that I have touched also have a _version_ field present.
When I attempt to re-run the sharding process that I included above, I receive an error related to the _version_ field value and the process terminates. If I attempt to set the version field explicitly, I receive the same error.
Here is the error message that I am encountering when I invoke the shard process:
Exception: version conflict for e8b7ba64-8c1e-4963-8bcb-f36b33216d69 expected=1484794833191043072 actual=-1
org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: version conflict for e8b7ba64-8c1e-4963-8bcb-f36b33216d69 expected=1484794833191043072 actual=-1
at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:424)
at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:180)
My goal is to repair my records so that I can run the shard process provided by DSpace. Can you recommend any additional action that I should take to repair these records?