0

I implemented fscrawler with elasticsearch. Rest is enabled. I can post a file to fscrawler and the text is correctly extracted and put in the elasticsearch index. I can verify that with Kibana.

I m not able to get the extracted text in the response.

I tried several setups in the _settings.yaml. But i don't get the text back in the reponse, unless i add debug=true as queryParam calling fscrawler endpoint.

http://localhost:8080/_document?debug=true

The endpoint is called directly with postman.

Here is my _settings.yaml

---
name: "idx"
fs:
  indexed_chars: 100%
  lang_detect: true
  continue_on_error: true
  logging: ERROR

  ocr:
    language: "eng"
    enabled: true
    pdf_strategy: "auto"
elasticsearch:
  nodes:
    - url: "https://elasticsearch:9200"
  username: "elastic"
  password: "Test123"
  ssl_verification: false
  store_source: true
  index_content: true
rest :
  url: "http://fscrawler:8080"

my fscrawler image:

dadoonet/fscrawler:2.10-SNAPSHOT

Elasticstackversion: 8.6.2

response:

{
    "ok": true,
    "filename": "JAVASCRIPT.pdf",
    "url": "https://elasticsearch:9200/idx/_doc/337d3e366ce4b765f650c5a87011e117"
}

I found no way to get the extracted text in the response, unless as i mentioned setting ?debug=true.

Ralle Mc Black
  • 1,065
  • 1
  • 8
  • 16

1 Answers1

1

You can either call Elasticsearch to get the indexed document:

curl https://localhost:9200/idx/_doc/337d3e366ce4b765f650c5a87011e117

Or call the simulate API of fscrawler.

dadoonet
  • 14,109
  • 3
  • 42
  • 49
  • Thank you for answering.I needed it to be indexed also, simulateapi doesnt do that. So it seems to me that both returned text and indexed is only possible with debug=true. I want avoid to call again elasticsearch to get the text. – Ralle Mc Black Mar 05 '23 at 17:05
  • Right. You need to call Elasticsearch. A GET by id should be super fast. – dadoonet Mar 05 '23 at 20:37
  • The only way today is indeed to use debug=true. – dadoonet Apr 03 '23 at 15:39