pyspark.sql data.frame understanding functions

Question

I am taking a mooc.

It has one assignment where a column needs to be converted to the lower case. sentence=lower(column) does the trick. But initially I thought that the syntax should be sentence=column.lower(). I looked at the documentation and I couldnt figure out the problem with my syntax. Would it be possible to explain how I could have figured out that I have a wrong syntax by searching online documentation and function definition?

I am specially confused as This link shows that string.lower() does the trick in case of the regular string python objects

from pyspark.sql.functions import regexp_replace, trim, col, lower
def removePunctuation(column):
    """Removes punctuation, changes to lower case, and strips leading and trailing spaces.

    Note:
        Only spaces, letters, and numbers should be retained.  Other characters should should be
        eliminated (e.g. it's becomes its).  Leading and trailing spaces should be removed after
        punctuation is removed.

    Args:
        column (Column): A Column containing a sentence.

    Returns:
        Column: A Column named 'sentence' with clean-up operations applied.
    """

    sentence=lower(column)

    return sentence

sentenceDF = sqlContext.createDataFrame([('Hi, you!',),
                                         (' No under_score!',),
                                         (' *      Remove punctuation then spaces  * ',)], ['sentence'])
sentenceDF.show(truncate=False)
(sentenceDF
.select(removePunctuation(col('sentence')))
.show(truncate=False))

score 2 · Answer 1 · answered Jul 13 '16 at 21:28

2

You are correct. When you are working with a string, if you want to convert it to lowercase, you should use str.lower().

And if you check the String page in the Python Documentation, you will see it has a lower method that should work as you expect:

a_string = "StringToConvert"
a_string.lower()                     # "stringtoconvert"

However. in the Spark example you provided, in your function removePunctuation you are NOT working with a singlestring, you are working with a Column. And a Column is a different object than a string, that is way you should use a method that works with a Column.

Specifically, you are working with this pyspark sql method. The next time you are in doubt on which method you need to implement, double check the datatype of your objects. Also, if you check the list of imports, you will see it is calling the lower method from pyspark.sql.functions

answered Jul 13 '16 at 21:28

Juan Carlos

45
4

i agree that column is a different type of object than strings. But then also how would i know that `.lower()` doesnt work with it and it should be `lower(column)` ? Document doesnt show a working example and therefore i would to know how to figure out proper syntax! – user2543622 Jul 14 '16 at 00:20
Well, you can check the "signature" of the called method. If you check the call to **removePunctuation** in this line: `.select(removePunctuation(col('sentence')))` you will see that before calling `removePunctuation`, it is calling `col('sentence')`. You can check in the documentation of [col](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.col) that it returns a **Column**. Also, in the comments of the method it says that removePunctuation receives a Column as an argument, instead of a string. – Juan Carlos Jul 14 '16 at 18:56
I think that i am not clear here. I understand the input is in col format. But how do i know that in the case of the col format I have to use `lower(col)` and i cannot use `col.lower()`? – user2543622 Jul 15 '16 at 00:25
1

`col` is an object of type `Column`. `col.lower()` means "I want to execute a method named `lower` to `col`". It also means that **Column** needs to have a method named `lower` in its class definition. However, if you check the definition of [Column](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.Column) you will not find any method named `lower`. That means you can't call `lower` on `col`. – Juan Carlos Jul 15 '16 at 17:32
However, you are also importing another function (`lower`) from a module called `pyspark.sql.functions`. This function is available globally. Think of `lower` not as a function of an object of type string, but as a globally available function, just like `print`. You can read more on modules and functions in [this page of The Python Tutorial](https://docs.python.org/3/tutorial/modules.html) – Juan Carlos Jul 15 '16 at 17:41

score 0 · Answer 2 · answered Jul 28 '16 at 00:02

0

This is how i managed to do it:

lowered = lower(column)
np_lowered = regexp_replace(lowered, '[^\w\s]', '')
trimmed_np_lowered = trim(np_lowered)

return trimmed_np_lowered

answered Jul 28 '16 at 00:02

Leonel Atencio

474
3
14

score 0 · Answer 3 · answered Aug 17 '16 at 10:53

0

   return trim(lower(regexp_replace(column, "\p{Punct}", ""))).alias('sentence')

answered Aug 17 '16 at 10:53

Abdalrahman

161
2
4

pyspark.sql data.frame understanding functions

3 Answers3