Questions tagged [manifoldcf]

Apache Manifold CF is an open source connector framework for website and enterprise search engines.

Apache ManifoldCF is an effort to provide an open source framework for connecting source content repositories like Microsoft Sharepoint and EMC Documentum, to target repositories or indexes, such as Apache Solr, Open Search Server, or ElasticSearch. Apache ManifoldCF also defines a security model for target repositories that permits them to enforce source-repository security policies.

Currently included connectors support FileNet P8 (IBM), Documentum (EMC), LiveLink (OpenText), Meridio (Autonomy), Windows shares (Microsoft), and SharePoint (Microsoft). Also included are a general CMIS connector, a generic file system connector, a general JDBC connector, an RSS feed connector, a Wiki connector, a DropBox connector, an email connector, and a general web connector. Currently supported targets include Apache Solr, QBase (formerly MetaCarta) GTS , OpenSearchServer and ElasticSearch.

30 questions
18
votes
3 answers

How to crawl a website that has SAML authentication using ManifoldCF or nutch?

I am trying to crawl a website, more specifically a Google Site using ManifoldCF that has SAML authentication and index the crawled data into Apache Solr. But as I crawl the URL, it gives me 302 redirection to login page and then says…
Saurabh Chaturvedi
  • 2,028
  • 2
  • 18
  • 39
8
votes
2 answers

Apache ManifoldCF. Unable to create repository connection to FileNet

I am trying to connect to FileNet from ManifoldCF without any success. The error I got is Connection status: Connection temporarily failed: Connection refused to host: 127.0.0.1; nested exception is: java.net.ConnectException: Connection refused:…
duvo
  • 1,634
  • 2
  • 18
  • 30
2
votes
1 answer

SessionException occurs when crawling with solrCloud

I using solrCloud 6.1.0. I trying to crawl with manifoldcf2.4. But it does not work. The following is the execution environment. java:1.8(However, it is 1.7 when installing manifoldcf) zookeeper:3.4.9 If i start job with manifoldcf, I can crawl the…
bunji
  • 39
  • 1
  • 3
2
votes
1 answer

how to maintain lastaccesstime using manifold cf

I am using manifold cf based windows fileshare connector to crawl files. But Manifold CF also updates the lastAccessTime of all files that it reads. I want to read all files without updating their lastAccessTime. Which files in Manifold CF I need to…
praddy
  • 169
  • 1
  • 11
2
votes
2 answers

Is manifold cf a good option for Google Drive indexing?

I am using apache manifoldcf open source project for indexing documents from Google Drive into my solr. Often I have seen it is quite inconsistent in indexing the data. Also it takes time to reflect even small number of documents in solr . Do you…
Saurabh Chaturvedi
  • 2,028
  • 2
  • 18
  • 39
1
vote
0 answers

Do I need to configure Authorities in ManifoldCF?

On Apache ManifoldCF I have configured a CMIS Repository Connector which accesses only one document repository. During the configuration phase, I provided the administrator user and password. I use this CMIS Repository Connector in two Jobs…
user9038848
  • 19
  • 2
  • 7
1
vote
0 answers

Word / PDF document snippet rendering in search

I'm interested in building a software system which will connect to various document sources, extract the content from the documents contained within each source, and make the extracted content available to a search engine such as Elastic or Solr.…
user2245766
  • 301
  • 1
  • 10
1
vote
1 answer

Apache ManifoldCF TIKA

I am trying to extract the text content of a PDF using the Apache Tika integration on Apache ManifoldCF, in order to ingest some PDF files on my Laptop in an Elasticsearch server. After properly creating the Tika Transformer and configuring it…
Valerio Storch
  • 301
  • 1
  • 3
  • 11
1
vote
1 answer

Crawling Jira with Manifoldcf and Solr - String index out of range

I am using Manifoldcf v2.7.1 and Solr v5.2.1 and trying to crawl Jira using the Jira connector and am getting the following error in Manifoldcf: Error: Repeated service interruptions - failure processing document: Error from server at…
pinkninja
  • 109
  • 1
  • 7
1
vote
1 answer

Job ManifoldCF works, but freezes after some seconds

I have installed ManifoldCF, they connectors and postgres. I have 2 jobs on my ManifoldCF : a LocalFile job to external SolR in production a JCIFS job to local SolR On this image, you can see the issue. I can start the job and they index…
MaxenceS
  • 25
  • 7
1
vote
1 answer

Extract file content with ManifoldCF

I'm trying to use ManifoldCF with the File System Connector. It works like a charm : with the Tika content extractor implemented, I got all the expected metadata from my documents. But... How to configure ManifoldCF in order to get the equivalent…
GoUeDaRd
  • 51
  • 4
1
vote
0 answers

Indexing ACL ManifoldCF + ElasticSearch + CMIS

I need to index ACL in Elastic Search using ManifoldCF and CMIS connector. I have added CMIS authority connector with params: Name: EVERYONE_AUTHORITY Description: Authority type: CMISAuthorityConnector Max connections: 10 Authority…
Pawel Czech
  • 41
  • 1
  • 2
1
vote
1 answer

Add a custom parameter to Solr while using Spring Data Solr

Is it possible to add an additional parameter to a Solr query using Spring Data Solr that generates the following request? "params": { "indent": "true", "q": "*.*", "_": "1430295713114", "wt": "java", "AuthenticatedUserName":…
virgium03
  • 627
  • 1
  • 5
  • 14
1
vote
2 answers

Is there an AmazonS3 connector available for ManifoldCF?

I would like to crawl an amazon s3 bucket using manifold to relay the crawl to OpenSearchServer. I've seen other products carry an amazon S3 connector and I'm just wondering if there is a publicly available one for ManifoldCF.
Mdalz
  • 154
  • 12
1
vote
1 answer

How to get "Document status" data through REST API with Apache ManifoldCF

We're using Apache ManifoldCF. In Admin UI there's report at Status Reports -> Document Status. Is it possible to get that content through ManifoldCF's Restful API? The closest thing I've found is org.apache.manifoldcf.crawler.RunDocumentStatus…
Touko
  • 11,359
  • 16
  • 75
  • 105
1
2