Questions tagged [apache-pig]

Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization which enables them to handle very large data sets.

Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization which enables them to handle very large data sets.

Pig runs in two execution modes: Local mode and MapReduce mode. Pig script can be written in two modes: Interactive mode and Batch mode.

At the present time, Pig's infrastructure layer consists of a compiler that produces sequences of Map-Reduce programs for which large-scale parallel implementations already exist (e.g. the Hadoop subproject). Pig's language layer currently consists of a textual language called Pig Latin which has the following key properties:

Ease of programming. It is trivial to achieve parallel execution of simple, "embarrassingly parallel" data analysis tasks. Complex tasks comprised of multiple interrelated data transformations are explicitly encoded as data flow sequences, easy to write and understand.
Optimization opportunities. The declarative way in which tasks are encoded permits the system to optimize their execution plan automatically, allowing the user to focus on semantics rather than efficiency.
Extensibility. Users can create their own functions to do special-purpose processing.

Official Website:

Useful Links:

5199 questions

258

votes

19 answers

Difference between Pig and Hive? Why have both?

My background - 4 weeks old in the Hadoop world. Dabbled a bit in Hive, Pig and Hadoop using Cloudera's Hadoop VM. Have read Google's paper on Map-Reduce and GFS (PDF link). I understand that- Pig's language Pig Latin is a shift from(suits the way…

hadoop hive apache-pig

asked Jul 28 '10 at 18:42

Arnkrishn

29,828
40
114
128

202

votes

17 answers

When to use Hadoop, HBase, Hive and Pig?

What are the benefits of using either Hadoop or HBase or Hive ? From my understanding, HBase avoids using map-reduce and has a column oriented storage on top of HDFS. Hive is a sql-like interface for Hadoop and HBase. I would also like to know how…

hadoop hbase hive apache-pig

asked Dec 17 '12 at 09:33

Khalefa

2,294
3
14
12

votes

7 answers

PIG how to count a number of rows in alias

I did something like this to count the number of rows in an alias in PIG: logs = LOAD 'log' logs_w_one = foreach logs generate 1 as one; logs_group = group logs_w_one all; logs_count = foreach logs_group generate SUM(logs_w_one.one); dump…

hadoop apache-pig

asked Mar 28 '12 at 03:29

kee

10,969
24
107
168

votes

8 answers

How do I get schema / column names from parquet file?

I have a file stored in HDFS as part-m-00000.gz.parquet I've tried to run hdfs dfs -text dir/part-m-00000.gz.parquet but it's compressed, so I ran gunzip part-m-00000.gz.parquet but it doesn't uncompress the file since it doesn't recognise the…

hadoop apache-pig hdfs parquet

asked Nov 24 '15 at 00:57

Super_John

1,767
2
14
27

votes

4 answers

Apache Pig: FLATTEN and parallel execution of reducers

I have implemented an Apache Pig script. When I execute the script it results in many mappers for a specific step, but has only one reducer for that step. Because of this condition (many mappers, one reducer) the Hadoop cluster is almost idle while…

hadoop apache-pig

asked Nov 07 '13 at 12:00

user2964640

votes

8 answers

Merging multiple files into one within Hadoop

I get multiple small files into my input directory which I want to merge into a single file without using the local file system or writing mapreds. Is there a way I could do it using hadoof fs commands or Pig? Thanks!

hadoop apache-pig

asked Aug 23 '10 at 13:59

uHadoop

votes

11 answers

Pig Latin: Load multiple files from a date range (part of the directory structure)

I have the following scenario- Pig version used 0.70 Sample HDFS directory structure: /user/training/test/20100810/ /user/training/test/20100811/ /user/training/test/20100812/ /user/training/test/20100813/

hadoop apache-pig

asked Aug 18 '10 at 18:39

Arnkrishn

29,828
40
114
128

votes

2 answers

How to get array/bag of elements from Hive group by operator?

I want to group by a given field and get the output with grouped fields. Below is an example of what I am trying to achieve:- Imagine a table named 'sample_table' with two columns as below:- F1 F2 001 111 001 222 001 123 002 222 002 333 003 555 I…

sql hadoop hive apache-pig bigdata

asked May 08 '13 at 15:03

Anuroop

votes

4 answers

Hadoop Pig: Passing Command Line Arguments

Is there a way to do this? eg, pass the name of the file to be processed, etc?

hadoop apache-pig

asked Nov 12 '10 at 15:29

downer

votes

1 answer

Reference columns in a FOREACH after a JOIN?

A = load 'a.txt' as (id, a1); B = load 'b.txt as (id, b1); C = join A by id, B by id; D = foreach C generate id,a1,b1; dump D; 4th line fails on: Invalid field projection. Projected field [id] does not exist in schema I tried to change to A.id but…

apache-pig

asked Nov 08 '11 at 13:32

ihadanny

4,377
7
45
76

votes

2 answers

How to do outer join on two columns in Pig Latin

I do outer joins on single columns in Pig like this result = JOIN A by id LEFT OUTER, B by id; How do I join on two columns, something like - WHERE A.id=B.id AND A.name=B.name What is the pig equivalent? I couldn't find any example in the pig…

hadoop apache-pig

asked Nov 07 '11 at 15:45

hese

3,397
8
25
34

votes

6 answers

How can I use the map datatype in Apache Pig?

I'd like to use Apache Pig to build a large key -> value mapping, look things up in the map, and iterate over the keys. However, there does not even seem to be syntax for doing these things; I've checked the manual, wiki, sample code, Elephant…

syntax dictionary hadoop apache-pig

asked Nov 01 '10 at 14:07

1frustratedpiggy

votes

6 answers

Pig vs Hive vs Native Map Reduce

I've basic understanding on what Pig, Hive abstractions are. But I don't have a clear idea on the scenarios that require Hive, Pig or native map reduce. I went through few articles which basically points out that Hive is for structured processing…

hadoop mapreduce hive apache-pig

asked Jul 30 '13 at 14:47

Maverick

votes

7 answers

How do I parse JSON in Pig?

I have a lot of gzip'd log files in s3 that has 3 types of log lines: b,c,i. i and c are both single level json: {"this":"that","test":"4"} Type b is deeply nested json. I came across this gist talking about compiling a jar to make this work. …

json apache-pig

asked Feb 16 '11 at 05:59

Eric Lubow

votes

5 answers

What is the difference between Apache Pig and Apache Hive?

What is the exact difference between Pig and Hive? I found that both have same functional meaning because they are used for doing same work. The only thing is implimentation which is different for both. So when to use and which technology? Is there…

hadoop hive apache-pig

asked Apr 23 '12 at 11:47

Ananda

1,572
7
27
54

2 3

…

99 100 Next