1

I'm working on language model and want to count the number pairs of two consequent words. I found an examples of such problem on scala whith slicing function. Though I didn't managed to find the analogy in pyspark

data.splicing(2).map(lambda (x,y): ((x,y),1).redcueByKey(lambda x,y: x+y)

I guess it should be something like that. The workaround solution might be a creating function that finds the next word in array, but I guess there should be a in-build solution.

Daniel Chepenko
  • 2,229
  • 7
  • 30
  • 56

1 Answers1

0

Maybe this will help. You can find other splitting methods here: Is there a way to split a string by every nth separator in Python?

from itertools import izip

text = "I'm working on language model and want to count the number pairs of two consequent words.\
        I found an examples of such problem on language model and want to count the number pairs"

i = iter(text.split())

rdd = sc.parallelize([" ".join(x) for x in izip(i,i)])

print rdd.map(lambda x: (x, 1)).reduceByKey(lambda x, y: x + y).collect()

[('found an', 1), ('count the', 2), ('want to', 2), ('examples of', 1), ('model and', 2), ('on language', 2), ('number pairs', 2), ("I'm working", 1), ('consequent words.I', 1), ('such problem', 1), ('of two', 1)]

user3689574
  • 1,596
  • 1
  • 11
  • 20