I had a similar issue, where I had to make a copy from production to our QA environment. We faced two problems:
- Firewall blocking all http(s) traffic between QA and production
- Snapshots are impossible due to heavy writes and Zookeeper setup timing out
So I created a solution by simply retrieving all documents on the production server via the select handler and dump this into a xml file, copy the files to the QA server and then put them in a location where the import could pick them up. To get this to work took me way too much time, which was due to both my lack of knowledge of SOLR and also because most examples on the interwebs are wrong and everybody is just copying each other. Hence I'm sharing my solution here.
My script to dump the documents:
#!/bin/bash
SOURCE_SOLR_HOST='your.source.host'
SOLR_CHUNK_SIZE=10000
DUMP_DIR='/tmp/'
indexesfile='solr-indexes.txt'
for index in `cat $indexesfile`; do
solrurl="http://${SOURCE_SOLR_HOST}:8983/solr/$index/select?indent=on&q=*:*&wt=xml"
curl "${solrurl}&rows=10" -o /tmp/$index.xml
numfound=`grep -i numfound /tmp/$index.xml | sed -e 's/.*numFound=\"\([0-9]*\)\".*/\1/'`
chunks=$(expr $numfound / $SOLR_CHUNK_SIZE )
for chunk in $( eval echo {0..$chunks}); do
start_at=$(expr $chunk \* $SOLR_CHUNK_SIZE )
curl "${solrurl}&rows=${SOLR_CHUNK_SIZE}&start=${start_at}" -o ${DUMP_DIR}solr/${index}_${chunk}.xml
done
rm /tmp/$index.xml
done
It's reading the indexes to dump from the solr-indexes.txt file, so you can define all indexes in there.
During one of my searches I ended up on this question and the answers here helped me a bit with the import, but not entirely. You see, the examples by Duvo and Segfaulter don't work if you copy-paste them into SOLR. For instance the requestHandler tag is ignored by SOLR if you don't use the correct case.
This is the correct format of what I added to the solrconfig:
<lib dir="${solr.install.dir:../../../..}/dist" regex="solr-dataimporthandler-7.5.0.jar" />
<lib dir="${solr.install.dir:../../../..}/dist" regex="solr-dataimporthandler-extras-7.5.0.jar" />
<requestHandler class="org.apache.solr.handler.dataimport.DataImportHandler" name="/dataimport">
<lst name="defaults">
<str name="config">data-config.xml</str>
</lst>
</requestHandler>
For the data-config.xml I used something similar to this:
<dataConfig>
<dataSource type="FileDataSource" />
<document>
<entity
name="yourindexhere"
processor="FileListEntityProcessor"
baseDir="/solr-import/"
fileName="yourindexhere_.*"
preImportDeleteQuery="*:*"
recursive="false"
rootEntity="false"
dataSource="null">
<entity
name="file"
processor="XPathEntityProcessor"
url="${yourindexhere.fileAbsolutePath}"
xsl="xslt/updateXml.xsl"
useSolrAddSchema="true"
stream="true">
</entity>
</entity>
</document>
</dataConfig>
I copied all dumps into the /solr-import/ directory and applied the above configurations to each and every index config. Via the UI I initiated the full-import, but you could also trigger this via the dataimport request.
The xsl transformation is performed by the default updateXml.xsl, so it will understand the dumped output created by SOLR and translate this automatically to the index schema. At least, that is if the schema between production and QA is the same. ;)
Also the FileListEntityProcessor is using a regex to be able to ingest multiple files. This was necessary as some of our indexes contain millions of items, and if you try to transform all of them at once the Java process will quickly run out of memory. So I chunked them to 10000 rows per file, which I found delivered the best performance.