How to convert a dataframe column to sequence

Question

I have a dataframe as below:

+-----+--------------------+
|LABEL|                TERM|
+-----+--------------------+
|    4|  inhibitori_effect|
|    4|    novel_therapeut|
|    4| antiinflammator...|
|    4|    promis_approach|
|    4|      cell_function|
|    4|          cell_line|
|    4|        cancer_cell|

I want to create a new dataframe by taking all terms as sequence so that I can use them with Word2vec. That is:

+-----+--------------------+
|LABEL|                TERM|
+-----+--------------------+
|    4|  inhibitori_effect, novel_therapeut,..., cell_line |

As a result I want to apply this sample code as given here: https://spark.apache.org/docs/latest/ml-features.html#word2vec

So far I have tried to convert df to RDD and map it. And then I could not manage to re-convert it to a df.

Thanks in advance.

EDIT:

import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.sql.SQLContext

val sc = new SparkContext(conf)
    val sqlContext: SQLContext = new HiveContext(sc)  

    val df = sqlContext.load("jdbc",Map(
      "url" -> "jdbc:oracle:thin:...",
      "dbtable" -> "table"))

    df.show(20)

    df.groupBy($"label").agg(collect_list($"term").alias("term"))

score 5 · Accepted Answer · edited May 23 '17 at 12:00

5

You can use collect_list or collect_set functions:

import org.apache.spark.sql.functions.{collect_list, collect_set}

df.groupBy($"label").agg(collect_list($"term").alias("term"))

In Spark < 2.0 it requires HiveContext and in Spark 2.0+ you have to enable hive support in SessionBuilder. See Use collect_list and collect_set in Spark SQL

edited May 23 '17 at 12:00

Community

1
1

answered May 12 '16 at 14:31

zero323

322,348
103
959
935

I am using Spark 1.4.1-hadoop2.6.0.jar. I have tried it as can be seen through editted post above. Still can't use those functions. What am I missing? – mlee_jordan May 12 '16 at 15:01
As far as I remember these are not available in 1.4 (you should really update though. There is been huge boost in performance and functionality since then, not to mention upcoming 2.0 introduces a few breaking changes). In 1.4 you should be able to use raw SQL query for example like [here](http://stackoverflow.com/a/34296928/1560062). – zero323 May 12 '16 at 15:10
Ok now when I update it to 1.6 I could compile. However, this time I got the following error: javax.jdo.JDOFatalUserException: Class org.datanucleus.api.jdo.JDOPersistenceManagerFactory was not found – mlee_jordan May 12 '16 at 15:24
Yes finally it works. It was my bad not to put all necessary jar files found in spark/lib folder. When using all my issue was solved. Thanks @zero323! – mlee_jordan May 12 '16 at 16:14

How to convert a dataframe column to sequence

1 Answers1

Linked