Questions tagged [fscrawler]

For everything related to FSCrawler project.

40 questions
2
votes
0 answers

How to ingest .doc / .docx files in elasticsearch?

I'm trying to index word documents in my elasticsearch environment. I tried using the elasticsearch ingest-attachment plugin, but it seems like it's only possible to ingest base64 encoded data. My goal is to index whole directories with word files.…
2
votes
1 answer

Index pdf files to AWS Elasticsearch service using Elasticsearch File System Crawler

I can index pdf files to a local Elasticsearch using Elasticsearch File System Crawler. The default, fscrawler setting has port, host and scheme parameters as shown below. { "name" : "job_name2", "fs" : { "url" : "/tmp/es", "update_rate" :…
Fisseha Berhane
  • 2,533
  • 4
  • 30
  • 48
1
vote
1 answer

Dockerized elasticsearch and fscrawler: failed to create elasticsearch client, disabling crawler… Connection refused

I received the following error when attempting to connect Dockerized fscrawler to Dockerized elasticsearch: [f.p.e.c.f.c.ElasticsearchClientManager] failed to create elasticsearch client, disabling crawler… [f.p.e.c.f.FsCrawler] Fatal error…
user2514157
  • 545
  • 6
  • 24
1
vote
1 answer

Is there a way to check which pdf strategy FSCrawler will use?

I am using FSCrawler's REST feature to scan PDFs as they are uploaded. I'm currently using the ocr_and_text pdf strategy, however ocr takes too long for the user to wait for a response. I would like to send the pdf to fscrawler synchronously to use…
koopmac
  • 936
  • 10
  • 27
1
vote
0 answers

FScrawler: perform OCR selectively only on PDF files that do not have text

I'm using FScrawler (2.7) to load text from PDFs into Elasticsearch (7.6.X). Most of PDF files have text, but some of PDF files contain images of scanned text and need to be OCRed. Is there a way to configure FScrawler such as that it performs OCR…
Paul
  • 11
  • 3
1
vote
1 answer

Indexing 7TB of data with elasticsearch. FScrawler stops after sometime

I am using fscrawler to create an index of data above 7TB. The indexing starts fine but then stops when the index size gets to 2.6gb. I believe this is a memory issue, how do I configure the memory? My machine memory is 40GB and I have assigned 12GB…
Denn
  • 447
  • 1
  • 6
  • 27
1
vote
1 answer

fscrawler 2.3 with elasticsearch 5.5 getting error string index out of range

I have ElasticSearch 5.5 with x-pack working without any issue. But while I trying use fscrawler 2.3 on a folder I get this error WARN [f.p.e.c.f.FsCrawlerImpl] Error while crawling c:/tmp/es: String index out of range: -1 What am I doing wrong?
Batrevenge
  • 11
  • 2
0
votes
1 answer

The Elasticsearch client version [7] is not compatible with the Elasticsearch cluster version [8.8.2]

I have upgraded Elasticsearch from 7.17.11 to 8.8.2. # curl localhost:9200 { "name" : "test.example.com", "cluster_name" : "es_master01", "cluster_uuid" : "U4n0aCHtTdinDZSH5jEcdg", "version" : { "number" : "8.8.2", "build_flavor" :…
Manoj Agarwal
  • 365
  • 2
  • 17
0
votes
0 answers

Fscrawler logs in Kubernetes and logstash

I have a Kubernetes Fscrawler deployment with several instances. The logs are mapped to a Persistent Volume. I have also Elasticstack 8 with Logstash. What I would like to do is sending the logs from the different Fscrawler to logstash to have a…
Ralle Mc Black
  • 1,065
  • 1
  • 8
  • 16
0
votes
1 answer

Push custom fields to metadata of PDF using fscrawler

I am using fscrawler to index PDF documents using the following command: /usr/bin/fscrawler --config_dir /home/user1/conf test_index --restart --loop 1 The metadata of PDF is indexed. I want to add custom fields towards the metadata of PDF and…
Manoj Agarwal
  • 365
  • 2
  • 17
0
votes
1 answer

Using fallback font 'LiberationSans' for 'CourierNew,Italic' warning with fscrawler v2.9

I am running fscrawler on two different CentOS 7.8 machines. On one machine, I get the following warning when running fscrawler: 13:03:28,449 WARN [o.a.p.p.f.PDTrueTypeFont] Using fallback font 'LiberationSans' for 'CourierNew,Italic' Whereas on…
Manoj Agarwal
  • 365
  • 2
  • 17
0
votes
1 answer

No SLF4J providers were found warning with fscrawler 2.10

I have upgraded fscrawler from 2.9 to 2.10. I tried the same command towards indexing that I used in the older version: /usr/bin/fscrawler --config_dir /home/user1/conf test_index --restart --loop 1 I see the following warning about SLF4J: SLF4J:…
Manoj Agarwal
  • 365
  • 2
  • 17
0
votes
0 answers

Fscrawler configuration

Hi I am launching Fscrawler with elastic search in kibana inside docker containers and I am getting following error fscrawler | Exception in thread "main" java.util.NoSuchElementException fscrawler | at…
0
votes
1 answer

fscrawler get extracted text in restapi response

I implemented fscrawler with elasticsearch. Rest is enabled. I can post a file to fscrawler and the text is correctly extracted and put in the elasticsearch index. I can verify that with Kibana. I m not able to get the extracted text in the…
Ralle Mc Black
  • 1,065
  • 1
  • 8
  • 16
0
votes
1 answer

FSCrawler docker-compose NoSuchElementException

I try to run FSCrawler via docker-compose following the steps described in https://fscrawler.readthedocs.io/en/fscrawler-2.9/installation.html#using-docker-compose. ELASTIC_VERSION = "7.17.8" FSCRAWLER_VERSION = "2.9" PWD = "" I verified that…
Fried
  • 41
  • 2
1
2 3