1

Let's take the traffic camera case as the example. Say, I have a huge dataset with the traffic camera records, which look like : plate_no | camera_id | city | location | direction | timestamp | etc | etc.

I'd like to get the results meet all of the followings:

  1. location like '%George Street'
  2. Oct 1 < timestamp < Oct 3
  3. Oct 4 < timestamp < Oct 5
  4. Oct 15 < timestamp < Oct 20

the approach we are taking now is index the data in SolrCloud, then get three result sets like 1&2, 1&3, 1&4, then use SparkSQL load them as RDD. at the end, use Spark to intersect the three result sets. Finally, you get the cars that showed up around George street in Oct1-3, Oct4-5 and Oct-20, But the performance is not ideal, it takes a lot of time for Spark SQL to query Solr and form the RDDs. We have only tested around 0.5 billion records, such query costs 1-2 mins~. It gets worse when we increase the dataset. We are looking for something around 30 secs~.

Just wondering what'd be the right tool for this task or is there any better approach than this.

Shengjie
  • 12,336
  • 29
  • 98
  • 139
  • Are you using [spark-solr connector](https://github.com/LucidWorks/spark-solr)? How big are the resulting dataframes from each individual date? How long to get those dataframes (date & location)? The final intersection should be cheap. – KrisP Jan 10 '16 at 18:50
  • @KrisP yes, we are using spark-solr, approximately 0.2-0.5 million records from each individual date. The final intersection is fast enough. From solr query to RDD formed takes almost 30 secs~ – Shengjie Jan 11 '16 at 02:19
  • Are you using leading wildcards? http://stackoverflow.com/questions/11766351/understanding-lucene-leading-wildcard-performance – Karsten R. Jan 11 '16 at 15:35
  • @KarstenR. yes, wildcard. – Shengjie Jan 11 '16 at 17:28
  • In this case you have to index and search with ReversedWildcardFilter: http://stackoverflow.com/questions/30597947/how-does-reversedwildcardfilterfactory-speed-up-wildcard-searches – Karsten R. Jan 11 '16 at 17:40

1 Answers1

0

Finally, We have found a reasonable solution for this, we put our hive table in 'Parquet+Snappy' format, then, we just used SparkSQL to query against different time period and fetch the different result sets, like 1&2, 1&3, 1&4. At the end, we did an 'intersect' on the result sets.

We got ride of Solr in this case completely. The performance is quite sound at the end.

e.g. SparkSQL does support Like query, intesect

Shengjie
  • 12,336
  • 29
  • 98
  • 139