Questions tagged [spark-koalas]

Koalas is an implementation of the pandas API on top of Apache Spark.

To learn more about koalas, you can

read the documentation
visit the GitHub repository

120 questions

votes

1 answer

What does this mean ? WARNING:root:'PYARROW_IGNORE_TIMEZONE' environment variable was not set

I am working in Python on a Jupyter Notebook, and I got this warning: WARNING:root:'PYARROW_IGNORE_TIMEZONE' environment variable was not set. I tried to remove it, but I couldn't. I tried to set PYARROW_IGNORE_TIMEZONE to 1, as I saw on some…

asked Dec 21 '20 at 19:24

Ousen92i

votes

0 answers

Why is Pandas-API-on-Spark's apply on groups a way slower than pyspark API?

I'm having strange performance results when comparing the two APIs in pyspark 3.2.1 that provide ability to run pandas UDF on grouped results of Spark Dataframe: df.groupBy().applyInPandas() ps_df.groupby().apply() - a new way of apply introduced…

apache-spark pyspark spark-koalas

asked Feb 13 '22 at 20:07

Mariusz

13,481
3
60
64

votes

1 answer

koalas throws ' Can't get attribute _fill_function' on

When I run the following code in a python script and run it with python directly I get the error below. When I start a pyspark session and then do the import of koalas, the creation of the data frame and call head() it runs fine and gives me the…

spark-koalas

asked Mar 22 '21 at 12:42

user3060684

votes

1 answer

Koalas / pyspark Failed to find data source: delta

When I try to write a Koalas DataFrame directly to a delta table using koalas.DataFrame.to_delta() locally I get the following Pyspark exception: java.lang.ClassNotFoundException: Failed to find data source: delta EDIT: ignore below, the problem…

apache-spark pyspark databricks delta-lake spark-koalas

asked Sep 03 '21 at 16:27

zyd

votes

1 answer

PandasNotImplementedError: The method `pd.Series.iter()` is not implemented. If you want to collect your data as an NumPy array

I try to create a new column in Koalas dataframe df. The dataframe has 2 columns: col1 and col2. I need to create a new column newcol as a median of col1 and col2 values. import numpy as np import databricks.koalas as ks # df is Koalas dataframe df…

python pandas dataframe databricks spark-koalas

asked Sep 29 '21 at 19:52

Fluxy

2,838
6
34
63

votes

1 answer

databricks.koalas has no attribute 'qcut' for decile

I am using koalas in databricks and trying to decile the data. Therefore I used df['Decile']= ks.qcut(df['Id'], q = 10, labels = False) I am getting AttributeError: module 'databricks.koalas' has no attribute 'qcut' Is there a work around?

databricks spark-koalas

asked Jan 27 '21 at 00:04

Santoo

votes

0 answers

Using astype on a koalas column gives strange result of datatype of column as

I have a column in my koalas dataframe called purchase_date. In databricks notebook, with runtime as 10.3, when I do the following lines of code, I get the dtype of the purchase_date column as

python databricks azure-databricks spark-koalas

asked Mar 08 '22 at 14:52

Anna

votes

1 answer

Use of koalas instead of pandas for numpy where function

I am new to koalas. I have been told to implement koalas instead of pandas in my work. Earlier when we have dataframe we convert that to pandas and use that for np.where with condition check inside. Example in pandas we used to do…

python pandas numpy spark-koalas

asked Dec 28 '21 at 16:49

Joe1988

votes

1 answer

Split a koalas column of lists into multiple columns

How do I go from df to df1 where df and df1 are shown below? df = koalas.DataFrame({"teams": [["SF", "NYG"] for _ in range(7)],'teams1':[np.random.randint(0,10) for _ in range(7)]}) df output: teams teams1 0 [SF, NYG] 0 1 [SF, NYG] 5 2…

python apache-spark pyspark spark-koalas

asked Dec 15 '21 at 09:45

figs_and_nuts

4,870
2
31
56

votes

2 answers

cannot assign a koalas series as a new column in koalas

I am not able to assign a series as a new column to a koalas dataframe. Below is the codebase that I am using: from databricks import…

apache-spark pyspark spark-koalas

asked Dec 15 '21 at 08:11

figs_and_nuts

4,870
2
31
56

votes

0 answers

Distributed index in pandas on pyspark koalas does not work as expected

There are 3 different kinds of default indexes in pandas on pyspark. I am not able to replicate their said behavior: Setting up to test: import pyspark.pandas as ps import pandas as pd import numpy as np import pyspark from pyspark.sql import…

apache-spark pyspark apache-spark-sql spark-koalas

asked Oct 24 '21 at 16:01

figs_and_nuts

4,870
2
31
56

votes

1 answer

How to speed up head function execution time in Koalas?

For large datasets, koalas.head(n) function takes a really long time. I understand that it tries to bring back all the data in driver node and then present the absolutely top n rows. Is there any quick way to analyse top n rows in koalas such that…

data-analysis spark-koalas

asked Oct 21 '21 at 09:05

Mohit Jain

votes

1 answer

Koalas GroupBy > Apply > Lambda > Series

I am trying to port some code from Pandas to Koalas to take advantage of Spark's distributed processing. I am taking a dataframe and grouping it on A and B and then applying a series of functions to populate the columns of the new dataframe. Here is…

pandas pandas-groupby databricks pandas-apply spark-koalas

asked Jul 06 '20 at 17:32

nineseven

votes

2 answers

Databricks Koalas Column Assignment Based on Another COlumn Value Lambda Function

Given a koalas Dataframe: df = ks.DataFrame({"high_risk": [0, 1, 0, 1, 1], "medium_risk": [1, 0, 0, 0, 0] }) Running a lambda function to get a new column based on the existing column values: df =…

python databricks spark-koalas

asked Oct 10 '19 at 20:32

ratchet

votes

1 answer

AttributeError: 'DataFrame' object has no attribute 'randomSplit'

I am trying to split my data into train and test sets. The data is a Koalas dataframe. However, when I run the below code I am getting the error: AttributeError: 'DataFrame' object has no attribute 'randomSplit' Please find below the code I am…

python dataframe pyspark azure-synapse spark-koalas

asked Mar 16 '22 at 04:11

yocodefreak

2 3 4 5 6 7 8 Next