7

I'd like to know equivalence in PySpark to the use of reset_index() command used in pandas. When using the default command (reset_index), as follows:

data.reset_index()

I get an error:

"DataFrame' object has no attribute 'reset_index' error"

Ronak Jain
  • 3,073
  • 1
  • 11
  • 17
pruthviraj
  • 71
  • 1
  • 2
  • Can you provide more to your question - what you are trying to achieve ? what is the expected outcome in a tabular format ? – dsk Nov 06 '20 at 05:57
  • 1
    You cannot use reset_index because Spark has not concept of index. The dataframe is distributed and is fundamentally different from pandas. – mck Nov 06 '20 at 06:53
  • If you just want to provide a numerical id to the rows then you can use `monotonically_increasing_id` – user238607 Nov 06 '20 at 08:23
  • If your problem is as simple as mine this can help [https://stackoverflow.com/questions/52318016/pyspark-add-sequential-and-deterministic-index-to-dataframe](https://stackoverflow.com/questions/52318016/pyspark-add-sequential-and-deterministic-index-to-dataframe) – CAV Jul 16 '21 at 22:30

1 Answers1

1

Like the other comments mentioned, if you do need to add an index to your DF, you can use:

from pyspark.sql.functions import monotonically_increasing_id

df = df.withColumn("index_column",monotonically_increasing_id())
Ben Kaan
  • 51
  • 2