1

I am using DataFrame in pyspark.sql. Why is the output different in Ubuntu vs Mac?

I am using only 10 documents, so N=10. The formula I used is tf-idf = (1+log(tf))*log(N/df). So you can see actually Mac gives the correct output but Ubuntu (inside a VM) gives the wrong output.

My tf-idf column is a FloatType(). I calculated it using a udf function.

Ubuntu output:

Ubuntu output

Mac output:

Mac output

halfer
  • 19,824
  • 17
  • 99
  • 186
yeeen
  • 4,911
  • 11
  • 52
  • 73
  • 1
    Maybe one uses Python 2 where `/` performs integer division? – timgeb Sep 16 '18 at 06:56
  • Oh there is a difference btn Python 2 and Python 3 in the division? @timgeb can u explain more? Yes my Ubuntu is installed with Python 2.7.12 (provided by my lecturer). My Mac is installed with Python 3.6.4 – yeeen Sep 16 '18 at 07:16

1 Answers1

3

As you have mentioned in comments that you are using python 2.7 in Ubuntu and Python 3.6 in Mac. Your pyspark code is doing divison (/)

In python3

>>> 3/2
1.5

In python2

>>> 3/2
1

Check out this answer for details on python2 vs python3 division and how to possibly adjust the behavior of your interpreter.

timgeb
  • 76,762
  • 20
  • 123
  • 145
moriarty007
  • 2,054
  • 16
  • 20