60

I would like to do the equivalent of this SQL but with Solr as my data store.

SELECT
   DISTINCT txt
FROM
   my_table;

What syntax would force Solr to only give me distinct values?

http://localhost:8983/solr/select?q=txt:?????&fl=txt

EDIT: So faceted searching seems to fit, but as I investigated it, I realized I had only detailed half of the problem.

My SQL query should have read...

SELECT
   DISTINCT SUBSTR(txt,0,3)
FROM
   my_table;

Any possibility of this with Solr?

dacracot
  • 22,002
  • 26
  • 104
  • 152

6 Answers6

79

Faceting would get you a results set that contains distinct values for a field.

E.g.

http://localhost:8983/solr/select/?q=*%3A*&rows=0&facet=on&facet.field=txt

You should get something back like this:

<response>
<responseHeader><status>0</status><QTime>2</QTime></responseHeader>
<result numFound="4" start="0"/>
<lst name="facet_counts">
 <lst name="facet_queries"/>
 <lst name="facet_fields">
  <lst name="txt">
        <int name="value">100</int>
        <int name="value1">80</int>
        <int name="value2">5</int>
        <int name="value3">2</int>
        <int name="value4">1</int>
  </lst>
 </lst>
</lst>
</response>

Check out the wiki for more information. Faceting is a really cool part of solr. Enjoy :)

http://wiki.apache.org/solr/SimpleFacetParameters#Facet_Fields

Note: Faceting will show the indexed value, I.e. after all the filters have been applied. One way to get around this is to use the copyfield method, so that you can create a facet version of the txt field. THis way your results will show the original value.

Lots of documentation on faceting available on the wiki. Or I did write some with screen shots.. which you can check out here:

http://www.craftyfella.com/2010/01/faceting-and-multifaceting-syntax-in.html

starball
  • 20,030
  • 7
  • 43
  • 238
CraftyFella
  • 7,520
  • 7
  • 47
  • 61
  • I got to this answer googling for solr distinct. Need multiple fields though. Something like distinct country, user. – HMR Sep 10 '15 at 07:44
  • Solved the multi field by indexing a multi field (field1+"."+field2). Now I need to know how many distinct field1, field2 there are. There are over 2 milion field1.field2. The following gave me a hint:http://yonik.com/solr-count-distinct/ not sure how to do this without json but it would look something like: `q=type:1&start=0&rows=0&json.facet={uniquecount:"unique(field1field2_s)"}` – HMR Sep 11 '15 at 02:45
  • 1
    How to retrieve all (i.e. >100) the unique values in a field? – RanonKahn May 30 '18 at 18:13
  • This won't work if we facet on a field that is been tokenized. – Scott Chu Apr 01 '21 at 09:07
25

For the DISTINCT part of your question, I think you may be looking for Solr's field collapsing / grouping functions. It will enable you to specify a field you want unique results from, create a group on those unique values and it will show you how many documents are that group.

You can then use the same substr stored in a separate field, and collapse on that.

Antony Stubbs
  • 13,161
  • 5
  • 35
  • 39
  • Depending on the field type, you might be able to use (left-anchored) wildcards. But as is often the case, the best solution with any index technology is to anticipate the question. I.E., if you know you are routinely going to search for SUBSTR(txt,0,3), then you create a field in the index populated by that substring. If instead the "txt" and substring length are variable, then n-gram or other processing is the best route. – Joe Atzberger Nov 18 '13 at 21:18
  • 3
    Group or field collapsing can't be used on multivalued field as I understand! – Scott Chu Jun 07 '14 at 10:28
9

Use the StatsComponent with parameter stats.calcdistinct to get a list of distinct values for a certain field:

Solr 7 https://lucene.apache.org/solr/guide/7_7/the-stats-component.html

Solr 6 https://cwiki.apache.org/confluence/display/solr/The+Stats+Component

It will also give you the count of distinct values. stats.calcdistinct is probably available since 4.7.

http://wiki.apache.org/solr/StatsComponent is outdated as it does not cover stats.calcdistinct

Example

/select?stats=on&stats.field=region&rows=0&stats.calcdistinct=true

"stats":{
  "stats_fields":{
    "region":{
      "min":"GB",
      "max":"GB",
      "count":20276,
      "missing":0,
      "distinctValues":["GB"],
      "countDistinct":1}}}}

Difference to Facets

In case of facets you need to know the count to request all, or you set the facet.limit to something really high and count the result yourself. Also, you need a string field for making facets work the way you need it here.

Risadinha
  • 16,058
  • 2
  • 88
  • 91
4

I would store the substring in a different field (let's call in txt_substring), then facet on txt_substring as CraftyFella showed.

Normally I'd use the n-gram tokenizer, but I don't think you can facet on that.

Mauricio Scheffer
  • 98,863
  • 23
  • 192
  • 275
3

Solr 5.1 and later has the new Facet Module that has integrated support for finding the number of unique values in a field. You can even find the number of unique values in a field for each bucket of a facet, and sort by that value to find the highest or lowest number of unique values.

Number of unique values in "myfield": json.facet={x:'unique(myfield)'}

Facet by "category" field, and for each category, show the number of unique values in "color":

json.facet={
  cat_breakdown : { terms : {  // group results by unique values of "category"
    field : category,
    facet : {
      x : "unique(color)",  // for each category, find the number of unique colors
      y : "avg(price)"      // for each category, find the average price
    }
  }}
}

This is in Solr 5.1 and later. More facet functions like "unique" are shown at http://yonik.com/solr-facet-functions/

Yonik
  • 2,341
  • 1
  • 18
  • 14
1

Best way to find the number of unique values in "myfield", using the JSON API:

http://YourCollectionAddress/select?json
={query:'\*:\*',limit:0,facet:{distinctCount:'unique(myfield)'}}
Abhinav Saxena
  • 1,990
  • 2
  • 24
  • 55
Sharif Shahriar
  • 114
  • 1
  • 5
  • this works for counting how many different field values there are but it does not return them – Suzana May 05 '20 at 18:14