5

As a part of my DSpace instance, I have a SOLR repository containing 12 million usage statistics records. Some records have migrated through multiple SOLR upgrades and do not conform to the current schema. 5 million of these records are missing a unique id field specified in my schema.

The DSpace system provides a mechanism to shard older usage statistics records into a separate solr shard using the following code.

DSPACE SHARD LOGIC:

        for (File tempCsv : filesToUpload) {
            //Upload the data in the csv files to our new solr core
            ContentStreamUpdateRequest contentStreamUpdateRequest = new ContentStreamUpdateRequest("/update/csv");
            contentStreamUpdateRequest.setParam("stream.contentType", "text/plain;charset=utf-8");
            contentStreamUpdateRequest.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true);
            contentStreamUpdateRequest.addFile(tempCsv, "text/plain;charset=utf-8");

            statisticsYearServer.request(contentStreamUpdateRequest);
        }
        statisticsYearServer.commit(true, true);

When I attempted to run this process, I received an error message for each of my records missing the unique id field and the 5 million records were dropped by the process.

I have attempted to replace these 5 million records in order to force the creation of a unique id field on each record. Here is the code that I am running to trigger that update. The query myQuery iterates over batches of several thousand records.

MY RECORD REPAIR PROCESS:

    ArrayList<SolrInputDocument> idocs = new ArrayList<SolrInputDocument>();
    SolrQuery sq = new SolrQuery();
    sq.setQuery(myQuery);
    sq.setRows(MAX);
    sq.setSort("time", ORDER.asc);

    QueryResponse resp  = server.query(sq);
    SolrDocumentList list = resp.getResults();

    if (list.size() > 0) {
        for(int i=0; i<list.size(); i++) {
            SolrDocument doc = list.get(i);
            SolrInputDocument idoc = ClientUtils.toSolrInputDocument(doc);
            idocs.add(idoc);
        }           
    }

    server.add(idocs);
    server.commit(true, true);
    server.deleteByQuery(myQuery);
    server.commit(true, true);

After running this process, all of the records in the repository have a unique id assigned. The records that I have touched also have a _version_ field present.

When I attempt to re-run the sharding process that I included above, I receive an error related to the _version_ field value and the process terminates. If I attempt to set the version field explicitly, I receive the same error.

Here is the error message that I am encountering when I invoke the shard process:

Exception: version conflict for e8b7ba64-8c1e-4963-8bcb-f36b33216d69 expected=1484794833191043072 actual=-1
org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: version conflict for e8b7ba64-8c1e-4963-8bcb-f36b33216d69 expected=1484794833191043072 actual=-1
    at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:424)
    at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:180)

My goal is to repair my records so that I can run the shard process provided by DSpace. Can you recommend any additional action that I should take to repair these records?

terrywb
  • 3,740
  • 3
  • 25
  • 50
  • 1
    Not a full answer but maybe this can help: the uid field got added as part of DSpace 3, together with the introduction of search and workflow stats cfr history of: https://github.com/DSpace/DSpace/blob/master/dspace/solr/statistics/conf/schema.xml#L308 So I imagine some process in 1.8->3.0 upgrading must take care of uids. Looking at solrconfig.xml, adding the uid seems to be part of an updateprocessor chain: https://github.com/DSpace/DSpace/blob/master/dspace/solr/statistics/conf/solrconfig.xml#L1828 But I didn't find any specific info as to where the uid gets generated for older stats. – Bram Luyten Nov 16 '14 at 11:34
  • 1
    See https://jira.duraspace.org/browse/DS-2212 for a continuation of this discussion. – terrywb Jan 21 '15 at 20:05

3 Answers3

1

It should be easier to modify the generated csv.

Try to add the id to the csv directly adding a method to do that before the firs method.

FileUtils.copyInputStreamToFile(csvInputstream, csvFile);

//<-a method call to a function that reopen the csv file and add the mandatory id to each line

filesToUpload.add(csvFile); //Add 10000 & start over again yearQueryParams.put(CommonParams.START, String.valueOf((i + 10000))); }

for (File tempCsv : filesToUpload) {

(...)

Adán
  • 381
  • 3
  • 14
  • Adán, thank you for this suggestion. Before I posted my question, I had some success by manipulating the CSV files. Ideally, I would like to resolve this issue in the section that I labeled as "MY RECORD REPAIR PROCESS". I presume that I am doing something incorrectly in that process that is causing the errors related to the "_version_" field. – terrywb Nov 18 '14 at 17:03
1

The sharding code in SolrLogger copies records into a new, empty core. The problem is that DSpace usage statistics documents from about DSpace 3 onwards contain a _version_ field, and this field is included in the copy during sharding.

When documents containing a _version_ field are added to a Solr index, this triggers Solr's optimistic concurrency functionality, which checks for an existing document with the same unique ID in the index. The logic goes roughly like this (see http://yonik.com/solr/optimistic-concurrency/):

  • _version_ > 1: Document version must exactly match
  • _version_ = 1: Document must exist
  • _version_ < 0: Document must not exist
  • _version_ = 0: Don't care (normal overwrite if exists)

The usage statistics documents containing a _version_ value > 1 thus make Solr look for an existing document with the same unique ID in the newly created year shard; however, clearly there is no such document at that point, hence the version conflict.

The copy process during the sharding creates temporary CSV files that are then imported into the new core. Luckily, Solr's CSV update handler can be told to exclude specific fields from the import, using the skip parameter: https://wiki.apache.org/solr/UpdateCSV#skip

Changing the sharding code like so

//Upload the data in the csv files to our new solr core
ContentStreamUpdateRequest contentStreamUpdateRequest = new ContentStreamUpdateRequest("/update/csv");
contentStreamUpdateRequest.setParam("stream.contentType", "text/plain;charset=utf-8");
+ contentStreamUpdateRequest.setParam("skip", "_version_");
contentStreamUpdateRequest.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true);
contentStreamUpdateRequest.addFile(tempCsv, "text/plain;charset=utf-8");

skips the _version_ field, which in turn disables the optimistic concurrency check.

This is discussed in https://jira.duraspace.org/browse/DS-2212 with a pull request at https://github.com/DSpace/DSpace/pull/893; hopefully this will be included in DSpace 5.2.

schweerelos
  • 2,189
  • 2
  • 17
  • 25
0

I was trying to upgrade 1.8.3 to 4.2 with 4 million records, all missing uid and version. I wrote a script to read from Solr (in batches of 10,000), write copies back in, and finally delete the originals. The results looked good until I tried sharding, when I saw the same issue reported here.

The CSV files contained correct version numbers. The exception report was

Exception: version conflict for 38dbd4db-240e-4c9b-a927-271fee5db750 expected=1490271991641407488 actual=-1 org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: version conflict for 38dbd4db-240e-4c9b-a927-271fee5db750 expected=1490271991641407488 actual=-1

The first record in temp/temp.2012.0.csv, begins

38dbd4db-240e-4c9b-a927-271fee5db750,1490271991641407488, ...

Havenless
  • 116
  • 2