3

I have configured a multicore solr cloud. Created a collection with 2 shrads and no replication. Cloud in the UI of solr

Through the solr UI 192.168.1.56:8983, I am able to get results to the query.

I want to do the same with pysolr, so tried running following:

import pysolr
zookeeper = pysolr.ZooKeeper("192.168.1.56:2181,192.168.1.55:2182")
solr = pysolr.SolrCloud(zookeeper, "random_collection")

the last line is not able to find the collection even though its there. Below is a error trace:

---------------------------------------------------------------------------
SolrError                                 Traceback (most recent call last)
<ipython-input-43-9f03eca3b645> in <module>()
----> 1 solr = pysolr.SolrCloud(zookeeper, "patent_colllection")

/usr/local/lib/python2.7/dist-packages/pysolr.pyc in __init__(self, zookeeper, collection, decoder, timeout, retry_timeout, *args, **kwargs)
   1176 
   1177     def __init__(self, zookeeper, collection, decoder=None, timeout=60, retry_timeout=0.2, *args, **kwargs):
-> 1178         url = zookeeper.getRandomURL(collection)
   1179 
   1180         super(SolrCloud, self).__init__(url, decoder=decoder, timeout=timeout, *args, **kwargs)

/usr/local/lib/python2.7/dist-packages/pysolr.pyc in getRandomURL(self, collname, only_leader)
   1315 
   1316     def getRandomURL(self, collname, only_leader=False):
-> 1317         hosts = self.getHosts(collname, only_leader=only_leader)
   1318         if not hosts:
   1319             raise SolrError('ZooKeeper returned no active shards!')

/usr/local/lib/python2.7/dist-packages/pysolr.pyc in getHosts(self, collname, only_leader, seen_aliases)
   1281         hosts = []
   1282         if collname not in self.collections:
-> 1283             raise SolrError("Unknown collection: %s", collname)
   1284         collection = self.collections[collname]
   1285         shards = collection[ZooKeeper.SHARDS]

SolrError: (u'Unknown collection: %s', 'random_collection')

Solr version is 6.6.2 and zookeeper version is 3.4.10

How to create a connection to solr cloud collection?

Pramod Patil
  • 757
  • 2
  • 10
  • 26
  • Seems pysolr is looking at clusterstate.json - as far as I know, Solr now uses a separate state.json in each collection dir instead. If you browse your zookeeper nodes manually, my guess is that there is no clusterstate.json present, and only state.json for each collection. – MatsLindh Nov 13 '17 at 11:59
  • state.json is there for each collection and its correct but clusterstate.json is empty, what should I write in it? @MatsLindh – Pramod Patil Nov 13 '17 at 12:09
  • You shouldn't write anything in it - pysolr should be upgraded to support the new cluster info format. Until pysolr does that, you could drop the Zookeeper integration and make a regular HTTP request to one of the nodes directly and let Solr handle the distribution and knowledge of the cluster state for you. – MatsLindh Nov 13 '17 at 12:22
  • problem is that, even in the web UI clusterstate.json is empty, whereas in state.json its showing 2 nodes active and there correspoding information. is it like by making http requests will solve the problem after some requests when I try pysolr? – Pramod Patil Nov 13 '17 at 12:27
  • No, `clusterstate.json` is no longer used. `pysolr` should add support for reading `state.json` for each collection instead. If you ignore the zookeeper-support in pysolr and use the regular http interface, Solr will route the request for you internally. – MatsLindh Nov 13 '17 at 13:31
  • yes, I got it. thanks @MatsLindh – Pramod Patil Nov 14 '17 at 04:40

3 Answers3

3

Pysolr currently does not support external zookeeper cluster. Pysolr checks for collections in clusterstate.json which Solr has improvised to state.json for each cluster, and clusterstate.json is kept empty.

To solve your problem for single collection you can hard-code ZooKeeper.CLUSTER_STATE variable in pysolr.py as follows:

ZooKeeper.CLUSTER_STATE = '/collections/random_collection/state.json'

pysolr.py could be found at /usr/local/lib/python2.7/dist-packages and maybe try reinstalling it with

pip install -e /usr/local/lib/python2.7/dist-packages/pysolr.py
Ashutosh
  • 66
  • 6
  • pysolr supports an external zookeeper cluster, doesn't it? The IPs you give it may not be the same as the Solr servers? The only change is that Solr has migrated to a per collection cluster state, which your hack neatly solves :-) – MatsLindh Nov 14 '17 at 08:25
1

Regular HTTP client works well even for SolrCloud.

This was tested with Solr 7.5 and PySolr 3.9.0:

import pysolr

solr_url="https://my.solr.url"
collection = "my_collection"
solr_connection = pysolr.Solr("{}/solr/{}".format(solr_url, collection), timeout=10)
results = solr_connection.search(...)

print(results.docs)
arghtype
  • 4,376
  • 11
  • 45
  • 60
0

A better hack would be feeding these collections in a generic way:

import pysolr
import json

zookeeper = pysolr.ZooKeeper("ZK_STRING")
collections = {}
for c in zookeeper.zk.get_children("collections"):
    collections.update(json.loads(zookeeper.zk.get("collections/{}/state.json".format(c))[0].decode("ascii")))
zookeeper.collections = collections
lsalamon
  • 788
  • 7
  • 12