Questions tagged [apache-pig]

Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization which enables them to handle very large data sets.

Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization which enables them to handle very large data sets.

Pig runs in two execution modes: Local mode and MapReduce mode. Pig script can be written in two modes: Interactive mode and Batch mode.

At the present time, Pig's infrastructure layer consists of a compiler that produces sequences of Map-Reduce programs for which large-scale parallel implementations already exist (e.g. the Hadoop subproject). Pig's language layer currently consists of a textual language called Pig Latin which has the following key properties:

  • Ease of programming. It is trivial to achieve parallel execution of simple, "embarrassingly parallel" data analysis tasks. Complex tasks comprised of multiple interrelated data transformations are explicitly encoded as data flow sequences, easy to write and understand.
  • Optimization opportunities. The declarative way in which tasks are encoded permits the system to optimize their execution plan automatically, allowing the user to focus on semantics rather than efficiency.
  • Extensibility. Users can create their own functions to do special-purpose processing.

Official Website:

Useful Links:

5199 questions
258
votes
19 answers

Difference between Pig and Hive? Why have both?

My background - 4 weeks old in the Hadoop world. Dabbled a bit in Hive, Pig and Hadoop using Cloudera's Hadoop VM. Have read Google's paper on Map-Reduce and GFS (PDF link). I understand that- Pig's language Pig Latin is a shift from(suits the way…
Arnkrishn
  • 29,828
  • 40
  • 114
  • 128
202
votes
17 answers

When to use Hadoop, HBase, Hive and Pig?

What are the benefits of using either Hadoop or HBase or Hive ? From my understanding, HBase avoids using map-reduce and has a column oriented storage on top of HDFS. Hive is a sql-like interface for Hadoop and HBase. I would also like to know how…
Khalefa
  • 2,294
  • 3
  • 14
  • 12
56
votes
7 answers

PIG how to count a number of rows in alias

I did something like this to count the number of rows in an alias in PIG: logs = LOAD 'log' logs_w_one = foreach logs generate 1 as one; logs_group = group logs_w_one all; logs_count = foreach logs_group generate SUM(logs_w_one.one); dump…
kee
  • 10,969
  • 24
  • 107
  • 168
52
votes
8 answers

How do I get schema / column names from parquet file?

I have a file stored in HDFS as part-m-00000.gz.parquet I've tried to run hdfs dfs -text dir/part-m-00000.gz.parquet but it's compressed, so I ran gunzip part-m-00000.gz.parquet but it doesn't uncompress the file since it doesn't recognise the…
Super_John
  • 1,767
  • 2
  • 14
  • 27
35
votes
4 answers

Apache Pig: FLATTEN and parallel execution of reducers

I have implemented an Apache Pig script. When I execute the script it results in many mappers for a specific step, but has only one reducer for that step. Because of this condition (many mappers, one reducer) the Hadoop cluster is almost idle while…
user2964640
  • 351
  • 3
  • 5
32
votes
8 answers

Merging multiple files into one within Hadoop

I get multiple small files into my input directory which I want to merge into a single file without using the local file system or writing mapreds. Is there a way I could do it using hadoof fs commands or Pig? Thanks!
uHadoop
  • 447
  • 1
  • 5
  • 7
29
votes
11 answers

Pig Latin: Load multiple files from a date range (part of the directory structure)

I have the following scenario- Pig version used 0.70 Sample HDFS directory structure: /user/training/test/20100810/ /user/training/test/20100811/ /user/training/test/20100812/ /user/training/test/20100813/
Arnkrishn
  • 29,828
  • 40
  • 114
  • 128
28
votes
2 answers

How to get array/bag of elements from Hive group by operator?

I want to group by a given field and get the output with grouped fields. Below is an example of what I am trying to achieve:- Imagine a table named 'sample_table' with two columns as below:- F1 F2 001 111 001 222 001 123 002 222 002 333 003 555 I…
Anuroop
  • 993
  • 3
  • 13
  • 25
22
votes
4 answers

Hadoop Pig: Passing Command Line Arguments

Is there a way to do this? eg, pass the name of the file to be processed, etc?
downer
  • 954
  • 2
  • 13
  • 24
21
votes
1 answer

Reference columns in a FOREACH after a JOIN?

A = load 'a.txt' as (id, a1); B = load 'b.txt as (id, b1); C = join A by id, B by id; D = foreach C generate id,a1,b1; dump D; 4th line fails on: Invalid field projection. Projected field [id] does not exist in schema I tried to change to A.id but…
ihadanny
  • 4,377
  • 7
  • 45
  • 76
21
votes
2 answers

How to do outer join on two columns in Pig Latin

I do outer joins on single columns in Pig like this result = JOIN A by id LEFT OUTER, B by id; How do I join on two columns, something like - WHERE A.id=B.id AND A.name=B.name What is the pig equivalent? I couldn't find any example in the pig…
hese
  • 3,397
  • 8
  • 25
  • 34
21
votes
6 answers

How can I use the map datatype in Apache Pig?

I'd like to use Apache Pig to build a large key -> value mapping, look things up in the map, and iterate over the keys. However, there does not even seem to be syntax for doing these things; I've checked the manual, wiki, sample code, Elephant…
1frustratedpiggy
  • 211
  • 1
  • 2
  • 4
20
votes
6 answers

Pig vs Hive vs Native Map Reduce

I've basic understanding on what Pig, Hive abstractions are. But I don't have a clear idea on the scenarios that require Hive, Pig or native map reduce. I went through few articles which basically points out that Hive is for structured processing…
Maverick
  • 484
  • 2
  • 9
  • 20
19
votes
7 answers

How do I parse JSON in Pig?

I have a lot of gzip'd log files in s3 that has 3 types of log lines: b,c,i. i and c are both single level json: {"this":"that","test":"4"} Type b is deeply nested json. I came across this gist talking about compiling a jar to make this work. …
Eric Lubow
  • 763
  • 2
  • 12
  • 30
18
votes
5 answers

What is the difference between Apache Pig and Apache Hive?

What is the exact difference between Pig and Hive? I found that both have same functional meaning because they are used for doing same work. The only thing is implimentation which is different for both. So when to use and which technology? Is there…
Ananda
  • 1,572
  • 7
  • 27
  • 54
1
2 3
99 100