Questions tagged [shark-sql]

Shark has been subsumed by Spark SQL. It was an open source distributed SQL query engine for Hadoop data. It brings state-of-the-art performance and advanced analytics to Hive users.

Shark has been subsumed by . It was an open source distributed SQL query engine for Hadoop data. It brings state-of-the-art performance and advanced analytics to Hive users.

59 questions
32
votes
5 answers

How to make shark/spark clear the cache?

when i run my shark queries, the memory gets hoarded in the main memory This is my top command result. Mem: 74237344k total, 70080492k used, 4156852k free, 399544k buffers Swap: 4194288k total, 480k used, 4193808k free, 65965904k…
venkat
  • 335
  • 1
  • 3
  • 7
15
votes
3 answers

Is LIMIT clause in HIVE really random?

The documentation of HIVE notes that LIMIT clause returns rows chosen at random. I have been running a SELECT table on a table with more than 800,000 records with LIMIT 1, but it always return me the same record. I'm using the Shark distribution,…
visakh
  • 2,503
  • 8
  • 29
  • 55
15
votes
2 answers

Comparing Cassandra's CQL vs Spark/Shark queries vs Hive/Hadoop (DSE version)

I would like to hear your thoughts and experiences on the usage of CQL and in-memory query engine Spark/Shark. From what I know, CQL processor is running inside Cassandra JVM on each node. Shark/Spark query processor attached with a Cassandra…
Minh Do
  • 329
  • 3
  • 7
9
votes
1 answer

UDF not working in Spark SQL

I'm trying to calculate Jaccard index on Spark SQL. My table on Hive has the following data: hive> select * from test_1; 1 ["rock","pop"] 2 ["metal","rock"] Table DDL: create table test_1 (id int, val array); I'm using the UDF from…
visakh
  • 2,503
  • 8
  • 29
  • 55
7
votes
6 answers

Connect to Spark SQL via ODBC

According to this page: https://spark.apache.org/sql/ you can connect existing BI tools to Spark SQL via ODBC or JDBC: I don't mean Shark as this is basically EOL: It is for this reason that we are ending development in Shark as a separate project…
Chris Matta
  • 3,263
  • 3
  • 35
  • 48
7
votes
1 answer

Spark Streaming historical state

I am building real time processing for detecting fraud ATM card transaction. in order to efficiently detect fraud, logic requires to have last transaction date by card, sum of transaction amount by day (or last 24 Hrs.) One of usecase is if card…
Jigar Parekh
  • 6,163
  • 7
  • 44
  • 64
4
votes
1 answer

Is it possible to run Shark queries over Spark Streaming data?

Is it possible to run Shark queries over the data contained in the DStreams of a Spark Streaming application? (for istance inside a foreachRDD call) Are there any specific API to do that? Thanks.
gprivitera
  • 933
  • 1
  • 8
  • 22
4
votes
1 answer

shark/spark throws NPE when querying a table

The development part of shark/spark wiki is really brief, so I tried to put together a code in an effort to programmatically query a table. Here it is ... object Test extends App { val master = "spark://localhost.localdomain:8084" val jobName =…
3
votes
1 answer

Accessing Shark tables (Hive) from Scala (shark-shell)

I have shark-0.8.0 which runs on hive-0.9.0. I am able to program on Hive by invoking shark. I created a few tables and loaded them with data. Now, I am trying to access the data from these tables using Scala. I invoked the Scala shell using…
visakh
  • 2,503
  • 8
  • 29
  • 55
2
votes
2 answers

Datastax DSE Cassandra, Spark, Shark, Standalone Programm

I use Datastax Enterprise 4.5. I hope I did the config right, I did it like on datastax website explained. I can write into the Cassandra DB with an Windowsservice, this works but i can't query with Spark using the where function. I start the…
richie676
  • 21
  • 3
2
votes
1 answer

Improving write performance in Hive

I am performing various calculations (using UDFs) on Hive. The computations are pretty fast enough, but I am hitting a roadblock with the write performance in Hive. My result set is close to ten million records, and it takes a few minutes to write…
visakh
  • 2,503
  • 8
  • 29
  • 55
2
votes
1 answer

Loading multiple JSON records from one file to HIVE

I am trying to load JSON files into Hive using JSON Serde. I am able to get it working for one JSON file at a time, but I was wondering whether it's possible to have more than one record in a JSON file at a time and get them loaded in one shot. To…
visakh
  • 2,503
  • 8
  • 29
  • 55
2
votes
2 answers

How many Shark servers are necessary in relation to Spark?

I'm new to Spark/Shark and have spun up a cluster with three Spark workers. I started installing Shark on the same three servers but I'm coming to the conclusion that maybe that's not needed and only one Shark server is necessary -- I can't find…
Bill
  • 347
  • 4
  • 13
2
votes
1 answer

Integrating cassandra and shark

I am trying to get shark working on Cassandra, so i pull the data from Cassandra into shark and run queries. I used CASH open source storage handler, it seems to work when i run shark locally but when in distributes mode looks like spark slaves…
2
votes
1 answer

Has anyone been successful running Apache Spark & Shark on Cassandra

I am trying to configure a 5 node cassandra cluster to run Spark/Shark to test out some Hive queries. I have installed Spark, Scala, Shark and configured according to Amplab [Running Shark on a cluster] …
kwasbob
  • 47
  • 2
  • 6
1
2 3 4