0

I understand that in SimplePostTool (post.jar), there is this command to automatically detect content types in a folder, and recursively scan it for documents for indexing into a collection:
bin/post -c gettingstarted afolder/

This has been useful for me to do mass indexing of all the files that are in the folder. Now that I'm moving to production and plans to use SolrJ to do the indexing as it can do more things like robustness checks and retires for indexes that fails.

However, I can't seems to find a way to do the same in SolrJ. Is it possible for this to be done in SolrJ? I'm using Solr 5.3.0

Thank you.

Regards,
Edwin

Edwin Yeo
  • 165
  • 1
  • 2
  • 14

1 Answers1

1

If you're looking to submit content to an extracting request handler (for indexing PDFs and similar rich documents), you can use the ContentStreamUpdateRequest method as shown at Uploading data with SolrJ:

SolrClient server = new HttpSolrClient("http://localhost:8983/solr/my_collection");
ContentStreamUpdateRequest req = new ContentStreamUpdateRequest("/update/extract");
req.addFile(new File("my-file.pdf"));
server.request(req);

To iterate through a directory structure recursively in Java, see Best way to iterate through a directory in Java.

If you're planning to index plain content (and not use the request handler), you can do that by creating the documents in SolrJ itself and then submitting the documents to the server - there's no need to write them to a temporary file in between.

Community
  • 1
  • 1
MatsLindh
  • 49,529
  • 4
  • 53
  • 84
  • Thank you MatsLindh. Yes this works. But do you know what to do if there are non-english characters (Eg; chinese) in the filename? Currently, it is all read as a series of '???'. – Edwin Yeo Oct 16 '15 at 07:50
  • @EdwinYeo You might have to do some work to convert it into proper unicode, depending on the underlying file system: See http://stackoverflow.com/questions/3072376/how-can-i-open-files-containing-accents-in-java for possible solutions - it does however seem to be an issue that can be caused by a number of different levels in the code. – MatsLindh Oct 16 '15 at 15:37
  • Thank you. I've managed to let it read the chinese characters in Eclipse. However, when I index the chinese characters in Solr using URLEncoder with UTF-8 encoding, it indexed something like "%E7%AB%8B%E9" instead of the chinese characters. What could be the reasons? – Edwin Yeo Oct 19 '15 at 03:23