3

I need to import data from one Solr instance to another instance Full data and index import.

I have searched and spend some times in google but I did not find proper solution. This link has similar question but i could not find the proper answer.

I am new to Solr hope I will be some help.

I Have one live running instance running in remote box I need to have similar data set in another data. So I am thinking full data import should be possible.

My question here is :

  • Does existing Solr support full data set import or any tools? or
  • I need write some custom data handler for this purpose?

Thanks in advance for any kind of help or information.

Gautam
  • 3,707
  • 5
  • 36
  • 57
  • Do you have to maintain daily concurrency between the 2 instances or just create another index? Because you can just copy over the whole core(with index) onto the new server and `solr start -s CORELOC` that should do it. This is for 5.+ version. Never used any below that. So don't know about it. – darthsidious May 12 '16 at 14:09
  • I just need to copy from one server to other server. – Gautam May 13 '16 at 09:18
  • Did you try to copy the whole index from one server to the other and use that index for your new instance? – darthsidious May 17 '16 at 13:39
  • I dont have tried that.. is that will work its not working still. – Gautam May 18 '16 at 08:17

3 Answers3

3

I had a similar issue, where I had to make a copy from production to our QA environment. We faced two problems:

  1. Firewall blocking all http(s) traffic between QA and production
  2. Snapshots are impossible due to heavy writes and Zookeeper setup timing out

So I created a solution by simply retrieving all documents on the production server via the select handler and dump this into a xml file, copy the files to the QA server and then put them in a location where the import could pick them up. To get this to work took me way too much time, which was due to both my lack of knowledge of SOLR and also because most examples on the interwebs are wrong and everybody is just copying each other. Hence I'm sharing my solution here.

My script to dump the documents:

#!/bin/bash
SOURCE_SOLR_HOST='your.source.host'
SOLR_CHUNK_SIZE=10000
DUMP_DIR='/tmp/'

indexesfile='solr-indexes.txt'
for index in `cat $indexesfile`; do
  solrurl="http://${SOURCE_SOLR_HOST}:8983/solr/$index/select?indent=on&q=*:*&wt=xml"
  curl "${solrurl}&rows=10" -o /tmp/$index.xml
  numfound=`grep -i numfound /tmp/$index.xml | sed -e 's/.*numFound=\"\([0-9]*\)\".*/\1/'`
  chunks=$(expr $numfound / $SOLR_CHUNK_SIZE )
  for chunk in $( eval echo {0..$chunks}); do
    start_at=$(expr $chunk \* $SOLR_CHUNK_SIZE )
    curl "${solrurl}&rows=${SOLR_CHUNK_SIZE}&start=${start_at}" -o ${DUMP_DIR}solr/${index}_${chunk}.xml
  done
  rm /tmp/$index.xml
done

It's reading the indexes to dump from the solr-indexes.txt file, so you can define all indexes in there.

During one of my searches I ended up on this question and the answers here helped me a bit with the import, but not entirely. You see, the examples by Duvo and Segfaulter don't work if you copy-paste them into SOLR. For instance the requestHandler tag is ignored by SOLR if you don't use the correct case.

This is the correct format of what I added to the solrconfig:

  <lib dir="${solr.install.dir:../../../..}/dist" regex="solr-dataimporthandler-7.5.0.jar" />
  <lib dir="${solr.install.dir:../../../..}/dist" regex="solr-dataimporthandler-extras-7.5.0.jar" />
  <requestHandler class="org.apache.solr.handler.dataimport.DataImportHandler" name="/dataimport">
  <lst name="defaults">
      <str name="config">data-config.xml</str>
  </lst>
  </requestHandler>

For the data-config.xml I used something similar to this:

<dataConfig>
  <dataSource type="FileDataSource" />
  <document>
    <entity
      name="yourindexhere"
      processor="FileListEntityProcessor"
      baseDir="/solr-import/"
      fileName="yourindexhere_.*"
      preImportDeleteQuery="*:*"
      recursive="false"
      rootEntity="false"
      dataSource="null">
      <entity
        name="file"
        processor="XPathEntityProcessor"
        url="${yourindexhere.fileAbsolutePath}"
        xsl="xslt/updateXml.xsl"
        useSolrAddSchema="true"
        stream="true">
      </entity>
    </entity>
  </document>
</dataConfig>

I copied all dumps into the /solr-import/ directory and applied the above configurations to each and every index config. Via the UI I initiated the full-import, but you could also trigger this via the dataimport request.

The xsl transformation is performed by the default updateXml.xsl, so it will understand the dumped output created by SOLR and translate this automatically to the index schema. At least, that is if the schema between production and QA is the same. ;)

Also the FileListEntityProcessor is using a regex to be able to ingest multiple files. This was necessary as some of our indexes contain millions of items, and if you try to transform all of them at once the Java process will quickly run out of memory. So I chunked them to 10000 rows per file, which I found delivered the best performance.

2

You can use Solr DataImportHandler to import data from one Solr instance to another.

  1. Update the solrconfig.xml to configure DataImportHandler settings

    <requesthandler class="org.apache.solr.handler.dataimport.DataImportHandler" name="/dataimport">
    <lst name="defaults">
        <str name="config">solr-data-config.xml</str>
    </lst>
    

  2. Enter the following in data-config.xml.

    <dataConfig>
    <document>
       <entity name="solr_doc" processor="SolrEntityProcessor" 
        query="mimeType:pdf" 
        url="http://your.solr.server:8983/solr/your-core">
       </entity>
    </document>
    </dataConfig>
    
  3. Go to destination Solr admin console, click on DataImport, select "solr_doc" from the Entity drop down, and click on Execute.

I found the following links useful

http://blog.trifork.com/2011/11/08/importing-data-from-another-solr/ https://wiki.apache.org/solr/DataImportHandler#SolrEntityProcessor

duvo
  • 1,634
  • 2
  • 18
  • 30
1

This is possible from my research I see. You can use data import handlers to pull data from one SOLR instance into other, however having said that, it would be only able to index the fields that are stored in Source Index.

for More Details you can read the following Blog: http://blog.trifork.com/2011/11/08/importing-data-from-another-solr/

using the XPathEntityProcessor in Data import Handler

Shivam
  • 674
  • 1
  • 4
  • 25
segFaulter
  • 180
  • 9
  • Welcome to StackOverflow! Please pull out any relevant content from the link and add it to your answer. Links are fine, but your answer should still be useful without it in case the page get's removed – Aaron Mar 06 '17 at 18:58