0

I have large rdd and I want to create 4 different rdd's out of that based on list of headers provided and save it in impala table by creating 4 parquest files.

like this:

a    b    c   d   e   f   g    h
--------------------------------
abc  1   3   4   5   7   9    11
xyz  2   5   7   4   9   4    12

I have list of columns for impala side tables:

table 1 impala side :- a,b,c 

table 2 impala side :- d, e, f
...

Also need to add new column for each table for user defined primary key like:

table 1 impala side : - id, a, b, c

Tried with rdd.map function but how to apply for a specific list:

rdd_1 = rdd.map(lambda x: (x['a'],x['b],x['c']))

Also how to add new column with different primary keys ?

Ani
  • 9
  • 2
  • 1
    As it is stated right now it is hard to understand what you really need. Could please [edit] your question, to include example input and expected output? You can use [How to make good reproducible Apache Spark Dataframe examples](https://stackoverflow.com/q/48427185/8371915) as an inspiration. – Alper t. Turker Jul 20 '18 at 23:11

1 Answers1

0

You can use operator itemgetter to get spesific list from rdd.

import operator

list1 = ['a', 'b', 'c']
list2 = ['d', 'e', 'f']

rddGetter1 = operator.itemgetter(*list1)
rddGetter2 = operator.itemgetter(*list2)

rdd1 = rdd.map(rddGetter1)
rdd2 = rdd.map(rddGetter2)
hamza tuna
  • 1,467
  • 1
  • 12
  • 17