Questions tagged [apache-crunch]

Simple and Efficient MapReduce Pipelines

Running on top of Hadoop MapReduce and Apache Spark, the Apache Crunch™ library is a simple Java API for tasks like joining and data aggregation that are tedious to implement on plain MapReduce. The APIs are especially useful when processing data that does not fit naturally into relational model, such as time series, serialized object formats like protocol buffers or Avro records, and HBase rows and columns. For Scala users, there is the Scrunch API, which is built on top of the Java APIs and includes a REPL (read-eval-print loop) for creating MapReduce pipelines.

http://crunch.apache.org/

52 questions

votes

2 answers

Is there a generic way of converting PCollection to PTable in Apache Crunch?

I have these method in a util class which are converting specific PCollection to specific PTable. public static PTable getPTableForCASegments(PCollection…

apache-crunch

asked Aug 14 '17 at 08:29

Vivek Rai

votes

1 answer

How to split ORC file based on size?

I have a requirement where I want to split 5GB ORC file into 5 files with 1 GB size each. ORC file is splittable. Does that mean we can only split file stripe by stripe ? but I have requirement where I want to split orc file based on size. for…

java apache hadoop orc apache-crunch

asked Mar 03 '17 at 14:48

Sham Desale

votes

1 answer

How does Apache Crunch pipeline generate map reduce jobs?

I'm new to hadoop pipeline framework like Crunch/Cascading. I was wondering at the bottom of those framework, do they generate original mapper and reducer class , like original MapReduce program ? From the Crunch source code, I didn't find the code…

java hadoop bigdata apache-crunch

asked Oct 06 '15 at 01:06

qingpan

votes

1 answer

WordCount with Apache Crunch into HBase Standalone

Currently I'm evaluating Apache Crunch. I followed a simple WordCount MapReduce job example: Afterwards I try to save the results into a standalone HBase. HBase is running (checked with jps and HBase shell) as described here:…

java hadoop mapreduce hbase apache-crunch

asked Dec 17 '14 at 14:57

Pa Rö

votes

0 answers

Pass a map (or concurrent hashmap) in a DoFn(apache crunch)

Since there's a limit for Hadoop counter size(and we dont want to increase it for just one job), I am creating a map(Map) which will increment the key if some conditions are met(Same as counters). There is already a DoFn (returning custom made…

java hadoop mapreduce concurrenthashmap apache-crunch

asked Jan 21 '20 at 10:37

Ashwin Gupta

votes

1 answer

What does read data as "streaming fashion" mean?

I was reading the Apache Crunch documentation and I found the following sentence: Data is read in from the filesystem in a streaming fashion, so there is no requirement for the contents of the PCollection to fit in memory for it to be read…

hadoop apache-spark hdfs hadoop-streaming apache-crunch

asked Apr 27 '17 at 01:31

dbustosp

4,208
25
46

votes

2 answers

Configuring number of reducers for a particular Dofn in Apache crunch

I understand that there are properties like CRUNCH_BYTES_PER_REDUCE_TASK or mapred.reduce.tasks to set number of reducers. Can anyone suggest on configuring / overriding the default reducers for a particular Dofn which is taking more time to…

hadoop mapreduce apache-crunch

asked Dec 22 '16 at 10:25

Rakesh Remo

votes

1 answer

which job map reduce can do but apache crunch can't?

I'm studying about apache crunch. As far as i know, crunch is an abstraction framework based on map-reduce framework. I intend to use crunch instead of map-reduce framework. My question is which job that map-reduce can do but crunch can't ?

mapreduce apache-crunch

asked Dec 22 '15 at 03:05

SieuCau

votes

1 answer

Not able to set mapred.job.queue.name in Oozie java action

I have an application which runs crunch jobs. I am trying to configure Oozie to run this job using a java action. My action is as given below,

java hadoop oozie oozie-coordinator apache-crunch

asked Nov 06 '15 at 07:42

Tanveer Dayan

votes

2 answers

Missing dependencies in Apache Crunch Scala build

I'm trying to build the Apache Crunch source code on my CentOS 7 machine, but am getting the following error in the crunch-spark project when I execute mvn package: [ERROR]…

scala maven apache-crunch

asked Mar 24 '15 at 20:29

Ben Watson

5,357
4
42
65

votes

1 answer

Hadoop InputFormat set Key to Input File Path

My hadoop job needs to be aware of the input path that each record is derived from. For example assume I am running a job over a collection of S3 objects: s3://bucket/file1 s3://bucket/file2 s3://bucket/file3 I would like to reduce key value pairs…

java hadoop apache-crunch

asked Mar 05 '15 at 19:43

qwwqwwq

6,999
2
26
49

votes

2 answers

Create hive table for schema less avro files

I have multiple avro files and each file have a STRING in it. Each avro file is a single row. How can I write hive table to consume all the avro files located in a single directory . Each file has a big number in it and hence I do not have any json…

hive avro apache-crunch

asked Jan 23 '15 at 04:34

AkD

votes

3 answers

Convert Avro in to Parquet format

I want to export data from database and convert in to Avro + Parquet format. Sqoop support Avro export but not Parquet. I try to convert the Avro object to Parquet using Apache Pig, Apache Crunch etc but nothing working out. Apache pig gives me…

apache-pig sqoop avro parquet apache-crunch

asked May 05 '14 at 23:03

Ananth Duari

2,859
11
35
42

votes

0 answers

Single Serialization Type (SST) of Pig/Cascading versus Multiple Serialization Type (MST) of Apache Crunch

In their FAQ here, Crunch teams highlights the main difference to be MST of Crunch over SST of Cascading. I am not sure how these are different. Can some one explain with an example?

hadoop cloudera cascading hadoop2 apache-crunch

asked Mar 09 '14 at 23:44

Aravind Yarram

78,777
46
231
327

votes

3 answers

How to trace the origin of "()V" failures in Avro?

I am using apache crunch and have got a cryptic error message from Avro: java.lang.NoSuchMethodError: org.apache.avro.mapred.AvroKey: method ()V not found at…

java reflection avro methodnotfound apache-crunch

asked Jan 06 '14 at 14:15

jayunit100

17,388
22
92
167

2 3 4 Next