0

My code c2020 is running and available what I visit http://localhost:8983/solr/#/c2020/query.

Locally, when I try to run:

solr-7.7.2> java -jar -Dc=c2020 example\exampledocs\post.jar "C:\temp\path_to\a_doc.pdf"

I get:

SimplePostTool version 5.0.0
Posting files to [base] url http://localhost:8983/solr/c2020/update using content-type application/xml...
POSTing file A Half Century of Macro Momentum_vf.pdf to [base]
SimplePostTool: WARNING: Solr returned an error #400 (Bad Request) for url: http://localhost:8983/solr/c2020/update
SimplePostTool: WARNING: Response: <?xml version="1.0" encoding="UTF-8"?>
<response>

<lst name="responseHeader">
  <int name="status">400</int>
  <int name="QTime">6</int>
</lst>
<lst name="error">
  <lst name="metadata">
    <str name="error-class">org.apache.solr.common.SolrException</str>
    <str name="root-error-class">java.io.CharConversionException</str>
  </lst>
  <str name="msg">Invalid UTF-8 middle byte 0xe3 (at char #10, byte #-1)</str>
  <int name="code">400</int>
</lst>
</response>
SimplePostTool: WARNING: IOException while reading response: java.io.IOException: Server returned HTTP response code: 400 for URL: http://localhost:8983/solr/c2020/update
1 files indexed.
COMMITting Solr index changes to http://localhost:8983/solr/c2020/update...
Time spent: 0:00:00.310

Now, if I run:

java -Durl=http://localhost:8983/solr/c2020/update/extract -jar example\exampledocs\post.jar "C:\temp\path_to\a_doc.pdf"

It works:

SimplePostTool version 5.0.0
Posting files to [base] url http://localhost:8983/solr/c2020/update/extract using content-type application/xml...
POSTing file a_doc.pdf to [base]
1 files indexed.
COMMITting Solr index changes to http://localhost:8983/solr/c2020/update/extract...
Time spent: 0:00:14.647

But it does not send all "fields" I want to see such as the file path or the file name.

I'd like to get the raw post doc to work if anyone could advise.

jason m
  • 6,519
  • 20
  • 69
  • 122
  • When you're posting the PDF file directly to `/update`, you're effectively trying to send pdf content as an XML or JSON formatted updated. That will not be a valid Solr update request - the UTF-8 error is telling you that it it wasn't able to decode the request to look at it properly. The simple post tool which you're using _should_ include the full path in the `id` field when the document is submitted. The file path is not metadata that is available inside the document itself, so it has to be submitted as additional information by the client. – MatsLindh Apr 29 '20 at 21:20
  • Since posting, it seems `C:\solr-8.5.0> java -jar -Dc=techproducts -Dauto example\exampledocs\post.jar example\exampledocs\*` does the trick. I guess -Dauto was the flag I was missing? – jason m Apr 29 '20 at 21:26
  • @MatsLindh Since you seem to know a good bit. How would I go about indexing `.ipynb` files? Is there a way to post these and let `solr` know that these are just json files? as in automap ipynb->json? – jason m Apr 29 '20 at 21:41
  • You can use the `update/json/docs` endpoint to index something close to arbitrary JSON. You won't use the `/extract` endpoint in that case, since I guess you want to do some custom processing on the content. See [Transforming and Indexing Custom JSON](https://lucene.apache.org/solr/guide/8_5/transforming-and-indexing-custom-json.html). However, if you only need to do simple "index everything I give you", you might have success by explicitly setting the content type to `application/json` instead of using `-Dauto` (which guesses based on the file extension iirc). `-Dtype=....`. – MatsLindh Apr 30 '20 at 05:26

0 Answers0