Questions tagged [spark-koalas]

Koalas is an implementation of the pandas API on top of Apache Spark.

To learn more about koalas, you can

120 questions
9
votes
1 answer

What does this mean ? WARNING:root:'PYARROW_IGNORE_TIMEZONE' environment variable was not set

I am working in Python on a Jupyter Notebook, and I got this warning: WARNING:root:'PYARROW_IGNORE_TIMEZONE' environment variable was not set. I tried to remove it, but I couldn't. I tried to set PYARROW_IGNORE_TIMEZONE to 1, as I saw on some…
6
votes
0 answers

Why is Pandas-API-on-Spark's apply on groups a way slower than pyspark API?

I'm having strange performance results when comparing the two APIs in pyspark 3.2.1 that provide ability to run pandas UDF on grouped results of Spark Dataframe: df.groupBy().applyInPandas() ps_df.groupby().apply() - a new way of apply introduced…
Mariusz
  • 13,481
  • 3
  • 60
  • 64
5
votes
1 answer

Koalas / pyspark Failed to find data source: delta

When I try to write a Koalas DataFrame directly to a delta table using koalas.DataFrame.to_delta() locally I get the following Pyspark exception: java.lang.ClassNotFoundException: Failed to find data source: delta EDIT: ignore below, the problem…
zyd
  • 833
  • 7
  • 16
4
votes
1 answer

PandasNotImplementedError: The method `pd.Series.__iter__()` is not implemented. If you want to collect your data as an NumPy array

I try to create a new column in Koalas dataframe df. The dataframe has 2 columns: col1 and col2. I need to create a new column newcol as a median of col1 and col2 values. import numpy as np import databricks.koalas as ks # df is Koalas dataframe df…
Fluxy
  • 2,838
  • 6
  • 34
  • 63
4
votes
1 answer

databricks.koalas has no attribute 'qcut' for decile

I am using koalas in databricks and trying to decile the data. Therefore I used df['Decile']= ks.qcut(df['Id'], q = 10, labels = False) I am getting AttributeError: module 'databricks.koalas' has no attribute 'qcut' Is there a work around?
Santoo
  • 355
  • 2
  • 10
3
votes
1 answer

Use of koalas instead of pandas for numpy where function

I am new to koalas. I have been told to implement koalas instead of pandas in my work. Earlier when we have dataframe we convert that to pandas and use that for np.where with condition check inside. Example in pandas we used to do…
Joe1988
  • 131
  • 1
  • 3
  • 8
3
votes
1 answer

Split a koalas column of lists into multiple columns

How do I go from df to df1 where df and df1 are shown below? df = koalas.DataFrame({"teams": [["SF", "NYG"] for _ in range(7)],'teams1':[np.random.randint(0,10) for _ in range(7)]}) df output: teams teams1 0 [SF, NYG] 0 1 [SF, NYG] 5 2…
figs_and_nuts
  • 4,870
  • 2
  • 31
  • 56
3
votes
2 answers

cannot assign a koalas series as a new column in koalas

I am not able to assign a series as a new column to a koalas dataframe. Below is the codebase that I am using: from databricks import…
figs_and_nuts
  • 4,870
  • 2
  • 31
  • 56
3
votes
0 answers

Distributed index in pandas on pyspark koalas does not work as expected

There are 3 different kinds of default indexes in pandas on pyspark. I am not able to replicate their said behavior: Setting up to test: import pyspark.pandas as ps import pandas as pd import numpy as np import pyspark from pyspark.sql import…
figs_and_nuts
  • 4,870
  • 2
  • 31
  • 56
3
votes
1 answer

How to speed up head function execution time in Koalas?

For large datasets, koalas.head(n) function takes a really long time. I understand that it tries to bring back all the data in driver node and then present the absolutely top n rows. Is there any quick way to analyse top n rows in koalas such that…
Mohit Jain
  • 733
  • 3
  • 9
  • 24
3
votes
1 answer

Koalas GroupBy > Apply > Lambda > Series

I am trying to port some code from Pandas to Koalas to take advantage of Spark's distributed processing. I am taking a dataframe and grouping it on A and B and then applying a series of functions to populate the columns of the new dataframe. Here is…
3
votes
2 answers

Databricks Koalas Column Assignment Based on Another COlumn Value Lambda Function

Given a koalas Dataframe: df = ks.DataFrame({"high_risk": [0, 1, 0, 1, 1], "medium_risk": [1, 0, 0, 0, 0] }) Running a lambda function to get a new column based on the existing column values: df =…
ratchet
  • 195
  • 4
  • 15
2
votes
1 answer

AttributeError: 'DataFrame' object has no attribute 'randomSplit'

I am trying to split my data into train and test sets. The data is a Koalas dataframe. However, when I run the below code I am getting the error: AttributeError: 'DataFrame' object has no attribute 'randomSplit' Please find below the code I am…
1
2 3 4 5 6 7 8