Questions tagged [apache-crunch]

Simple and Efficient MapReduce Pipelines

Running on top of Hadoop MapReduce and Apache Spark, the Apache Crunch™ library is a simple Java API for tasks like joining and data aggregation that are tedious to implement on plain MapReduce. The APIs are especially useful when processing data that does not fit naturally into relational model, such as time series, serialized object formats like protocol buffers or Avro records, and HBase rows and columns. For Scala users, there is the Scrunch API, which is built on top of the Java APIs and includes a REPL (read-eval-print loop) for creating MapReduce pipelines.

http://crunch.apache.org/

52 questions
5
votes
2 answers

Is there a generic way of converting PCollection to PTable in Apache Crunch?

I have these method in a util class which are converting specific PCollection to specific PTable. public static PTable getPTableForCASegments(PCollection
Vivek Rai
  • 73
  • 5
3
votes
1 answer

How to split ORC file based on size?

I have a requirement where I want to split 5GB ORC file into 5 files with 1 GB size each. ORC file is splittable. Does that mean we can only split file stripe by stripe ? but I have requirement where I want to split orc file based on size. for…
Sham Desale
  • 51
  • 1
  • 3
3
votes
1 answer

How does Apache Crunch pipeline generate map reduce jobs?

I'm new to hadoop pipeline framework like Crunch/Cascading. I was wondering at the bottom of those framework, do they generate original mapper and reducer class , like original MapReduce program ? From the Crunch source code, I didn't find the code…
qingpan
  • 406
  • 1
  • 4
  • 14
3
votes
1 answer

WordCount with Apache Crunch into HBase Standalone

Currently I'm evaluating Apache Crunch. I followed a simple WordCount MapReduce job example: Afterwards I try to save the results into a standalone HBase. HBase is running (checked with jps and HBase shell) as described here:…
Pa Rö
  • 449
  • 1
  • 6
  • 18
2
votes
0 answers

Pass a map (or concurrent hashmap) in a DoFn(apache crunch)

Since there's a limit for Hadoop counter size(and we dont want to increase it for just one job), I am creating a map(Map) which will increment the key if some conditions are met(Same as counters). There is already a DoFn (returning custom made…
2
votes
1 answer

What does read data as "streaming fashion" mean?

I was reading the Apache Crunch documentation and I found the following sentence: Data is read in from the filesystem in a streaming fashion, so there is no requirement for the contents of the PCollection to fit in memory for it to be read…
dbustosp
  • 4,208
  • 25
  • 46
2
votes
2 answers

Configuring number of reducers for a particular Dofn in Apache crunch

I understand that there are properties like CRUNCH_BYTES_PER_REDUCE_TASK or mapred.reduce.tasks to set number of reducers. Can anyone suggest on configuring / overriding the default reducers for a particular Dofn which is taking more time to…
2
votes
1 answer

which job map reduce can do but apache crunch can't?

I'm studying about apache crunch. As far as i know, crunch is an abstraction framework based on map-reduce framework. I intend to use crunch instead of map-reduce framework. My question is which job that map-reduce can do but crunch can't ?
SieuCau
  • 195
  • 1
  • 2
  • 15
2
votes
1 answer

Not able to set mapred.job.queue.name in Oozie java action

I have an application which runs crunch jobs. I am trying to configure Oozie to run this job using a java action. My action is as given below,
Tanveer Dayan
  • 496
  • 1
  • 7
  • 18
2
votes
2 answers

Missing dependencies in Apache Crunch Scala build

I'm trying to build the Apache Crunch source code on my CentOS 7 machine, but am getting the following error in the crunch-spark project when I execute mvn package: [ERROR]…
Ben Watson
  • 5,357
  • 4
  • 42
  • 65
2
votes
1 answer

Hadoop InputFormat set Key to Input File Path

My hadoop job needs to be aware of the input path that each record is derived from. For example assume I am running a job over a collection of S3 objects: s3://bucket/file1 s3://bucket/file2 s3://bucket/file3 I would like to reduce key value pairs…
qwwqwwq
  • 6,999
  • 2
  • 26
  • 49
2
votes
2 answers

Create hive table for schema less avro files

I have multiple avro files and each file have a STRING in it. Each avro file is a single row. How can I write hive table to consume all the avro files located in a single directory . Each file has a big number in it and hence I do not have any json…
AkD
  • 427
  • 10
  • 19
2
votes
3 answers

Convert Avro in to Parquet format

I want to export data from database and convert in to Avro + Parquet format. Sqoop support Avro export but not Parquet. I try to convert the Avro object to Parquet using Apache Pig, Apache Crunch etc but nothing working out. Apache pig gives me…
Ananth Duari
  • 2,859
  • 11
  • 35
  • 42
2
votes
0 answers

Single Serialization Type (SST) of Pig/Cascading versus Multiple Serialization Type (MST) of Apache Crunch

In their FAQ here, Crunch teams highlights the main difference to be MST of Crunch over SST of Cascading. I am not sure how these are different. Can some one explain with an example?
Aravind Yarram
  • 78,777
  • 46
  • 231
  • 327
2
votes
3 answers

How to trace the origin of "()V" failures in Avro?

I am using apache crunch and have got a cryptic error message from Avro: java.lang.NoSuchMethodError: org.apache.avro.mapred.AvroKey: method ()V not found at…
jayunit100
  • 17,388
  • 22
  • 92
  • 167
1
2 3 4