Questions tagged [qubole]

Qubole Data Service (QDS) is cloud Big Data service running on an elastic Hadoop-based cluster

Source Creators of Facebook’s Big Data infrastructure and Apache Hive have leveraged their experience to deliver Qubole Data Service (QDS) – a cloud Big Data service offering the same advanced capabilities used by Big Data savvy organizations.

Minimize operational interaction and provide your data analysts with an easy to use graphical interface, built-in connectors, and seamless, elastic cloud infrastructure.

Your Hadoop cluster is ready within minutes post signup, letting you focus on building sophisticated data pipelines, running queries, scheduling jobs and monetizing your big data.

An auto-scaling cluster, improved I/O optimization, faster queries and support for hybrid pricing - realize cost savings of as much as 50%-60% in total, while accomplishing tasks faster.

87 questions
6
votes
1 answer

Stratified Sampling in Hive

The following returns a 10% sample of the A and X columns stratified by the values of X. select A, X from( select A, count(*) over (partition by X) as cnt, rank() over (partition by X order by rand()) as rnk from my_table)…
Amelio Vazquez-Reina
  • 91,494
  • 132
  • 359
  • 564
5
votes
1 answer

How to kill hadoop job gracefully/intercept `hadoop job -kill`

My Java application runs on mapper and creates child processes using Qubole API. Application stores child qubole queryIDs. I need to intercept kill signal and shutdown child processes before exit. hadoop job -kill jobId and yarn application -kill…
leftjoin
  • 36,950
  • 8
  • 57
  • 116
4
votes
1 answer

Divide Spark DataFrame data into separate files

I have the following DataFrame input from a s3 file and need to transform the data into the following desired output. I am using Spark version 1.5.1 with Scala, but could change to Spark with Python. Any suggestions are welcome. DataFrame…
satoukum
  • 1,188
  • 1
  • 21
  • 31
3
votes
0 answers

Fetch all Column Statistics using Single Query Hive

I understand that all the column statistics can be computed for a Hive table using the command- ANALYZE TABLE Table1 COMPUTE STATISTICS; Then Specific column level stats can be fetched through the command - DESCRIBE FORMATTED…
Abhi Nandan
  • 195
  • 3
  • 11
3
votes
1 answer

Insert into ElasticSearch using Hive/Qubole

I am trying to insert data into elastic search from a hive table. CREATE EXTERNAL TABLE IF NOT EXISTS es_temp_table ( dt STRING, text STRING ) STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler' …
stogers
  • 259
  • 2
  • 12
2
votes
1 answer

How do you write a presto query to split a string into its own column

Trying to splint a string into multiple columns in qubole using presto query. {"field0":[{"startdate":"2022-07-13","lastnightdate":"2022-07-16","adultguests":5,"childguests":0,"pets":null}]} Would like startdate,lastnightdate,adultguests,childguests…
Abe
  • 23
  • 3
2
votes
1 answer

need regexp_extract help, beginner

I have string column "49b8b35e-b62c-4a42-9d73-192d131d127a,03c8a7e0-5153-11ec-873a-0242ac11000a,eec8aee4-0500-4940-b319-15924cc2d248" this string column has 3 values separate by ",". (value1,value2,value3). there is no guarantees that vaule2 and…
ajk
  • 21
  • 1
2
votes
1 answer

Data comparisons in Qubole

I am very new to Qubole.We recently migrated Oracle ebiz data to Saleforce.We have both Ebiz and Salesforce data in the Qubole Data Lake.There are some discrepancies between Ebiz and Salesforce.What is the technology I can use on Qubole to find…
user2280352
  • 145
  • 11
2
votes
1 answer

Pyspark Logging: Printing information at the wrong log level

Thanks for your time! I'd like to create and print legible summaries of my (hefty) data to my output when debugging my code, but stop creating and printing those summaries once finished to speed things up. I was advised to use logging, which I…
Amit
  • 41
  • 2
  • 6
2
votes
1 answer

How to create external tables from parquet files in s3 using hive 1.2?

I have created an external table in Qubole(Hive) which reads parquet(compressed: snappy) files from s3, but on performing a SELECT * table_name I am getting null values for all columns except the partitioned column. I tried using different…
S.Mehra
  • 56
  • 1
  • 6
2
votes
1 answer

Debug failed shuffles in hadoop map reduces

I am seeing as the size of the input file increase failed shuffles increases and job complete time increases non linearly. eg. 75GB took 1h 86GB took 5h I also see average shuffle time increase 10 fold eg. 75GB 4min 85GB 41min Can someone point me…
Jal
  • 2,174
  • 1
  • 18
  • 37
2
votes
2 answers

Fixing java.lang.NoSuchMethodError: com.amazonaws.util.StringUtils.trim

Consider the following error: 2018-07-12 22:46:36,087 FATAL [main] org.apache.hadoop.mapred.YarnChild: Error running child : java.lang.NoSuchMethodError: com.amazonaws.util.StringUtils.trim(Ljava/lang/String;)Ljava/lang/String; at…
Jal
  • 2,174
  • 1
  • 18
  • 37
2
votes
1 answer

java.io.FileNotFound exception while writing to apache spark in qubole

I have a code in apache spark 1.6.3 running on qubole which writes data to multiple tables(parquet format) on s3. At the time of writing to tables I keep getting java.io.FileNotFound exception. I am even setting:…
2
votes
0 answers

Kafka Connect Hive Integration issue

I am using kafka connect for hive integration to create hive tables along with partitions on s3. After starting connect distributed process and making a post call to listen to a topic, as soon as there is some data in the topic, I can see in the…
2
votes
1 answer

Median value from table with number:count format

Given a table +------------+-----------+ | Number | Count | +------------+-----------+ | 0 | 7 | +------------+-----------+ | 1 | 1 | +------------+-----------+ | 2 | 3 …
Lenix
  • 23
  • 3
1
2 3 4 5 6