0

Some of my data in snowflake is in json str format, but in actual it is list of floats. I used an udf to convert json str to list of floats but seems like snowflake is internally auto converting list of floats to string format again. Just want to know if this is how it works with snowflake or is there any better method to store list of floats in their actual format. I don't want to process the data everytime to convert it from json str to list of floats.

Use below code to demostrate the problem

connection_parameters = {
        "account": "MY_ACCOUNT"
        "user": "USER",
        "password": "PASSWORD",
        "role": "MY_ROLE",
        "warehouse": "MY_WH",
        "database": "MY_DB",
        "schema": "MY_SCHEMA"
    }
table = "MY_TABLE"
sf_session = Session.builder.configs(connection_parameters).create()
from snowflake.snowpark.functions import udf
from snowflake.snowpark.types import ArrayType, DoubleType, StringType
import json
from typing import List

def parse_embedding_from_string(x: str) -> List[float]:
    res = json.loads(x)
    return res

retrieve_embedding = udf(parse_embedding_from_string)


df = sf_session.createDataFrame(data=[['[0.4, 2.57, 3.47]'], ['[34.50, 16.34, 12.9]'], ['[413.0, 1.211, 8.41]'],  ['[0.4, 8.1, 10.11]'], ['[-6.89, 7.1, -12.1]'], ['[14.0, -21.0, 3.12]'], ['[11.0, 44.1, 26.2]'], ['[-4.4, 5.8, -0.10]']], schema=["embedding"])

df = df.withColumn("embedding_new", retrieve_embedding(col("embedding")))

# Output - 
df.toPandas().iloc[0]["EMBEDDING_NEW"]

Below is the output

'[\n 0.4,\n 2.57,\n 3.47\n]'

NikhilKV
  • 48
  • 6

2 Answers2

0

The Snowflake connectors do not support passing arrays in either direction. Passing an array will convert it to a string formatted as JSON. On the Python side, you can parse the string to convert it back to an array.

When sending data to Snowflake, esp. using a bind variable, you can convert arrays to a JSON-formatted string and use Snowflake's parse_json function to convert it back to an array. There's a good example showing that here:

https://community.snowflake.com/s/article/HowTo-Programmatically-insert-the-array-data-using-the-bing-variable-via-python-connector

On the Python side, you can do something like this after retrieving the array as a string:

import ast
my_array = ast.literal_eval(input_string)

There is a full explanation of that here:

How to convert string representation of list to a list

Greg Pavlik
  • 10,089
  • 2
  • 12
  • 29
  • Thanks for confirming that. I just want to bring up that using json.loads(input_string) is significantly faster compared to literal_eval. – NikhilKV Dec 08 '22 at 08:03
  • Interesting because I had that first, but then I decided to look it up in the SO Python section to see if there was a more canonical way to handle arrays specifically. Thanks for the information on speed comparisons between the two. – Greg Pavlik Dec 08 '22 at 14:38
0

The simplest way is to do:

import json
json.load(input_string)
orellabac
  • 2,077
  • 2
  • 26
  • 34