Pyspark, when trying to join two RDD, I got a UnicodeEncode error

Question

UnicodeEncodeError: 'ascii' codec can't encode character u'\xa9' in position 261: ordinal not in range(128)

Above is what I got after I've made a new table. Actually I used this command following ; table3 = u' '.join((table1, table2)).encode('utf-8').strip() But it did not work, I will put my code and really output for each RDD.

The code to create first RDD

table1=sc.textFile('inventory').map(lambda line:next(csv.reader([line]))).map(lambda fields:((fields[0],fields[8],fields[10]),1))

First RDD real output

[(('BibNum', 'ItemCollection', 'ItemLocation'), 1),
(('3011076', 'ncrdr', 'qna'), 1),
 (('2248846', 'nycomic', 'lcy'), 1)]

The code to create second RDD

table2=sc.textFile('checkouts').map(lambda line:next(csv.reader([line]))).map(lambda fields:((fields[0],fields[3],fields[5]),1))

Second RDD real output

[(('BibNum', 'ItemCollection', 'CheckoutDateTime'), 1),
(('1842225', 'namys', '05/23/2005 03:20:00 PM'), 1), 
(('1928264', 'ncpic', '12/14/2005 05:56:00 PM'), 1),
(('1982511', 'ncvidnf', '08/11/2005 01:52:00 PM'), 1),
(('2026467', 'nacd', '10/19/2005 07:47:00 PM'), 1)]

And lastly, I tried following code table3 = u' '.join((table1, table2)).encode('utf-8').strip(), to join table1 and table2. But it did not work. Please enlighten me if you have any idea for this error.

Even though I am able to get the values from each RDD, it can have some problem for text files? Is there any way to fix it ? I want to join each table. — Gideok Seong, Apr 14 '18 at 15:27
Possible duplicate of [How do you perform basic joins of two RDD tables in Spark using Python?](https://stackoverflow.com/questions/31257077/how-do-you-perform-basic-joins-of-two-rdd-tables-in-spark-using-python) — Alper t. Turker, Apr 14 '18 at 15:27
@user8371915 I used normal join syntax but it happend same error so that's why I tried above syntax. — Gideok Seong, Apr 14 '18 at 15:28
`u' '.join((table1, table2)).encode('utf-8').strip()` - is not `join` syntax. It is random piece of code with no relation to Spark `joins` other than the name. And if your code fails no matter what, then clearly the error is not here. Best guess it is in `csv.reader([line])` as `csv` reader doesn't support unicode. Finally, if you tried to join both `RDD` as they are, you didn't read what are the expectations for RDD joins. If you want someone to be able to answer this question, start with [mcve]. — Alper t. Turker, Apr 14 '18 at 15:37
@user8371915 Thank you for pointing out me first, the reason why I used csv.reader is because in my data, there are a few columns including many delimiters like comma(,) in just one column, so when I splitted by comma, it was not splitted properly, that's why I used for csv.reader as a last option and I didn't even know that csv.reader does not support unicode. — Gideok Seong, Apr 14 '18 at 15:49
On what key are you trying to join on? Can you provide an example of a successful join? It seems like you don't have anything interesting in the values, why do you need to join them? Maybe you can use "union" and then reduceByKey or groupByKey? — user3689574, Apr 15 '18 at 06:37
@user3689574 Sorry for late response. I was trying to use join because I wanted to make new table(RDD) to have common rows as using join. Such that if first RDD has BibNum 123 and ItemCollection is 'abc' and it would match with second RDD to have same BibNum 123 and ItemCollection '123' .. That's why I tried to use join. Am I doing for the correct direction? — Gideok Seong, Apr 16 '18 at 00:22
@user3689574 Hence, in short I am using two keys such as BibNum and ItemCollection — Gideok Seong, Apr 16 '18 at 00:23

score 0 · Answer 1 · answered Apr 16 '18 at 10:04

0

Let's see if I understand what you need.
You have two rdds and you want to join them based on two values.
First you can clean the headline (First row) of the rdd.
Then define the key for the join and then do the join.
I will use a slightly less effective but easy to understand cleaning method
(You can find a more effective one here: Remove first element in RDD without using filter function)

rdd1 = sc.parallelize([(('BibNum', 'ItemCollection', 'ItemLocation'), 1),
(('3011076', 'ncrdr', 'qna'), 1),
 (('2248846', 'nycomic', 'lcy'), 1),
                       (('1928264', 'ncpic', '12/14/2005 05:56:00 PM'), 1)])

rdd2 = sc.parallelize([(('BibNum', 'ItemCollection', 'CheckoutDateTime'), 1),
(('1842225', 'namys', '05/23/2005 03:20:00 PM'), 1), 
(('1928264', 'ncpic', '12/14/2005 05:56:00 PM'), 1),
(('1982511', 'ncvidnf', '08/11/2005 01:52:00 PM'), 1),
(('2026467', 'nacd', '10/19/2005 07:47:00 PM'), 1)])

rdd1_for_join = rdd1.zipWithIndex().filter(lambda x: x[1] != 0)\
.map(lambda x: ( (x[0][0][0], x[0][0][1]), x[0] ))

rdd2_for_join = rdd2.zipWithIndex().filter(lambda x: x[1] != 0)\
.map(lambda x: ( (x[0][0][0], x[0][0][1]), x[0] ))

print rdd1_for_join.join(rdd2_for_join).collect()

[(('1928264', 'ncpic'), ((('1928264', 'ncpic', '12/14/2005 05:56:00 PM'), 1), (('1928264', 'ncpic', '12/14/2005 05:56:00 PM'), 1)))]

answered Apr 16 '18 at 10:04

user3689574

1,596
1
11
20

Sir, above real outputs are part of the output.. so I need to know the general way can be applied to whole csv file.. but thank you for your response. – Gideok Seong Apr 16 '18 at 19:32
What do you mean by "general way"? I showed you how to join two rdds using two keys. This will work regardless of the size of the rdds. You can use the same way with more or less keys. What else do you need? – user3689574 Apr 17 '18 at 08:02
Sorry for making you confused, I mean I have so many rows in csv file, but when you used sc.parallerize , you put the values in person, so is it okay to put only a few data in it?? – Gideok Seong Apr 17 '18 at 10:50
I used sc.parallelize to make an rdd from your example data, you don't need to use it because you already have the rdds loaded right? I thought your problem was with the join.. – user3689574 Apr 17 '18 at 11:32
Yea, that's right,, I already made two rdds, so my problem is just to join each RDDs but it is not working. – Gideok Seong Apr 18 '18 at 00:16
Sir, I know your code works well, but when I use my two datasets, and when they are joined, list index out of range error keeps happening – Gideok Seong Apr 18 '18 at 19:13

Pyspark, when trying to join two RDD, I got a UnicodeEncode error

1 Answers1