Why does this python code work in the pyspark but not in spark-submit?

Question

I'm fairly inexperienced with python, and I'm having trouble getting some code running.

counts = {key:len(list(group)) for key, group in it.groupby(sorted(topics))}

That line will run in pyspark (interactive mode) but if I attempt to spark-submit it I get a SyntaxError exception. The following code is equivalent and does run in both cases:

counts = {}
for key, group in it.groupby(sorted(topics)):
    counts[key] = len(list(group))

Can anyone tell me why the first code doesn't work in spark-submit. If it makes a difference, the code is being executed within a function 1 tab out.

The exception I get using a dictionary comprehension:

Traceback (most recent call last):
  File "./sessions.py", line 24, in <module>
    execfile("./sessionSearch.py")
  File "./sessionSearch.py", line 50
    counts = {poop:len(list(group)) for poop, group in it.groupby(sorted(topics))}
                                      ^
SyntaxError: invalid syntax

Please specify the specific syntax error that you get. – sabbahillel Jan 28 '16 at 18:49 — sabbahillel, Jan 28 '16 at 18:49

score 3 · Accepted Answer · edited May 23 '17 at 11:52

3

Your cluster runs Python 2.6, which doesn't support dictionary comprehension syntax.

Either use a generator expression plus the dict() function (see Alternative to dict comprehension prior to Python 2.7), or configure your cluster to deploy Python 2.7.

Using dict() your line would be:

counts = dict((key, len(list(group))) for key, group in it.groupby(sorted(topics)))

edited May 23 '17 at 11:52

Community

1
1

answered Jan 28 '16 at 18:55

Martijn Pieters

1,048,767
296
4,058
3,343

Why does this python code work in the pyspark but not in spark-submit?

1 Answers1