Questions tagged [udf]

A user-defined function (UDF) is a function provided by the user of a program or environment, in a context where the usual assumption is that functions are built into the program or environment. Although the term is widely known in Hadoop components such Hive and Pig, it is also used in other contexts such programming languages and some DBMSs.

From the docs:

Introduction

Pig provides extensive support for user defined functions (UDFs) as a way to specify custom processing. Pig UDFs can currently be implemented in three languages: Java, Python, and JavaScript.

The most extensive support is provided for Java functions. You can customize all parts of the processing including data load/store, column transformation, and aggregation. Java functions are also more efficient because they are implemented in the same language as Pig and because additional interfaces are supported such as the Algebraic Interface and the Accumulator Interface.

Limited support is provided for Python and JavaScript functions. These functions are new, still evolving, additions to the system. Currently only the basic interface is supported; load/store functions are not supported. Furthermore, JavaScript is provided as an experimental feature because it did not go through the same amount of testing as Java or Python. At runtime note that Pig will automatically detect the usage of a scripting UDF in the Pig script and will automatically ship the corresponding scripting jar, either Jython or Rhino, to the backend.

537 questions
25
votes
3 answers

Spark UDF for StructType / Row

I have a "StructType" column in spark Dataframe that has an array and a string as sub-fields. I'd like to modify the array and return the new column of the same type. Can I process it with UDF? Or what are the alternatives? import…
Danil Kirsanov
  • 451
  • 1
  • 4
  • 8
25
votes
2 answers

Unable to use an existing Hive permanent UDF from Spark SQL

I have previously registered a UDF with hive. It is permanent not TEMPORARY. It works in beeline. CREATE FUNCTION normaliseURL AS 'com.example.hive.udfs.NormaliseURL' USING JAR 'hdfs://udfs/hive-udfs.jar'; I have spark configured to use the hive…
Rob Cowie
  • 22,259
  • 6
  • 62
  • 56
19
votes
2 answers

Spark SQL nested withColumn

I have a DataFrame that has multiple columns of which some of them are structs. Something like this root |-- foo: struct (nullable = true) | |-- bar: string (nullable = true) | |-- baz: string (nullable = true) |-- abc: array (nullable =…
Jon
  • 3,985
  • 7
  • 48
  • 80
18
votes
1 answer

How to allow sklearn K Nearest Neighbors to take custom distance metric?

I have a custom distance metric that I need to use for KNN, K Nearest Neighbors. I tried following this, but I cannot get it to work for some reason. I would assume that the distance metric is supposed to take two vectors/arrays of the same…
makansij
  • 9,303
  • 37
  • 105
  • 183
18
votes
1 answer

Spark UDF with varargs

Is it an only option to list all the arguments up to 22 as shown in documentation? https://spark.apache.org/docs/1.5.0/api/scala/index.html#org.apache.spark.sql.UDFRegistration Anyone figured out how to do something similar to this?…
devopslife
  • 668
  • 1
  • 9
  • 21
13
votes
8 answers

Is there a way to measure string similarity in Google BigQuery

I'm wondering if anyone knows of a way to measure string similarity in BigQuery. Seems like would be a neat function to have. My case is i need to compare the similarity of two urls as want to be fairly sure they refer to the same article. I can…
andrewm4894
  • 1,451
  • 4
  • 17
  • 37
12
votes
1 answer

BigQuery User Defined Aggregation Function?

I know I can define a User Defined Function in order to perform some custom calculation. I also know I can use the 'out-of-the-box' aggregation functions to reduce a collection of values to a single value when using a GROUP BY clause. Is it possible…
Stewart_R
  • 13,764
  • 11
  • 60
  • 106
11
votes
1 answer

Hive UDF for selecting all except some columns

The common query building pattern in HiveQL (and SQL in general) is to either select all columns (SELECT *) or an explicitly-specified set of columns (SELECT A, B, C). SQL has no built-in mechanism for selecting all but a specified set of columns.…
Sim
  • 13,147
  • 9
  • 66
  • 95
11
votes
3 answers

Spark SQL grouping: Add to group by or wrap in first() if you don't care which value you get.;

I have a query in Spark SQL like select count(ts), truncToHour(ts) from myTable group by truncToHour(ts). Where ts is of timestamp type, truncToHour is a UDF that truncates the timestamp to hour. This query does not work. If I try, select…
Paul Z Wu
  • 555
  • 1
  • 5
  • 16
10
votes
1 answer

Difference between a map and udf

When I work with DataFrames in Spark, I have to sometimes edit only the values of a particular column in that DataFrame. For eg. if I have a count field in my dataframe, and If I would like to add 1 to every value of count, then I could either write…
void
  • 2,403
  • 6
  • 28
  • 53
10
votes
3 answers

Making API call as part of UDF in BigQuery - possible?

I'm wondering if it would be possible to make a api call to the google maps geocoding api within a UDF in BigQuery? I have Google analytics geo fields such as { "geoNetwork_continent": "Europe", "geoNetwork_subContinent": "Eastern…
8
votes
3 answers

Select all columns of a Hive Struct

I have a requirement to select * from all columns from a hive struct. Hive create table script is here below Create Table script Select * from the table displays each struct as a column select * from table The requirement i have is to display all…
Abhijit Nayak
  • 101
  • 1
  • 1
  • 3
8
votes
3 answers

How to deal with Spark UDF input/output of primitive nullable type

The issues: 1) Spark doesn't call UDF if input is column of primitive type that contains null: inputDF.show() +-----+ | x | +-----+ | null| | 1.0| +-----+ inputDF .withColumn("y", udf { (x: Double) => 2.0 }.apply($"x") // will not be…
Artur Rashitov
  • 474
  • 4
  • 12
8
votes
2 answers

Installing MySQL libmysqlclient-dev and UDF files on Mac OSX

I am trying to install the following package on my mac in order to test my API on my local environment but thus far I have not succeeded. https://github.com/spachev/mysql_udf_bundle I have tried various things such as: brew install…
Ben Carey
  • 16,540
  • 19
  • 87
  • 169
8
votes
2 answers

User Sub with Optional parameters - not visible in Macro window

I have a macro that goes through column(s) and removed numbers from all cells in the range. I would like to add an optional parameter, so I can call the sub while telling it which columns to run on. Here's what I have: Sub…
BruceWayne
  • 22,923
  • 15
  • 65
  • 110
1
2 3
35 36