0

I am trying to read in data from a csv file then do a transpose. Spark doesn't seem to have a function for that, so (for now) I am reading that file into a Pandas dataframe, then do a transpose() then convert/ingest into a Sparks dataframe.

The data in Pandas after transpose(), and results in pdft looks like this:

                         0   ...                       76
name           TotalRevenue  ...  TaxEffectOfUnusualItems
ttm          94,950,000,000  ...                        0
12/31/2022   94,950,000,000  ...                        0
12/31/2021   89,113,000,000  ...                        0
12/31/2020   85,528,000,000  ...                        0
12/31/2019   91,244,000,000  ...                        0
12/31/2018   91,247,000,000  ...                        0
12/31/2017   87,352,000,000  ...                        0

then after df = session.createDataFrame(pdft), df.show(5), we have this:

+--------------+-------------------+------------------+--------------------+
|             0|                  1|                 2|                   3|
+--------------+-------------------+------------------+--------------------+
|  TotalRevenue|\tNetInterestIncome|\t\tInterestIncome|\t\t\tInterestInc...|
|94,950,000,000|     52,462,000,000|    72,565,000,000|      37,919,000,000|
|94,950,000,000|     52,462,000,000|    52,462,000,000|      37,919,000,000|
|89,113,000,000|     42,934,000,000|    47,672,000,000|      29,282,000,000|
|85,528,000,000|     43,360,000,000|    51,585,000,000|      34,029,000,000|
+--------------+-------------------+------------------+--------------------+
only showing top 5 rows

The first column of the transposed data is not present in the Spark dataframe, how can I get that back? Thanks!

Oli
  • 9,766
  • 5
  • 25
  • 46
user10129585
  • 51
  • 1
  • 4
  • Does this answer your question? [Converting pandas dataframe to PySpark dataframe drops index](https://stackoverflow.com/questions/68606518/converting-pandas-dataframe-to-pyspark-dataframe-drops-index) – krezno Jul 02 '23 at 16:57
  • I suspect what you're considering the first column is actually the pandas dataframe's index. try `reset_index()` while passing to `createDataFrame` – samkart Jul 03 '23 at 06:29

2 Answers2

0

The name column of your pandas dataframe is an index and createDataFrame does not take the index into account. Try doing data = spark.createDataFrame(df.reset_index()).

Illustration:

df = pd.DataFrame([[1, 3], [2, 4]], columns=['a', 'b'])
spark.createDataFrame(df).show()
+---+---+
|  a|  b|
+---+---+
|  1|  3|
|  2|  4|
+---+---+

df = df.set_index('a')
spark.createDataFrame(df).show()
+---+
|  b|
+---+
|  3|
|  4|
+---+

spark.createDataFrame(df.reset_index()).show()
+---+---+
|  a|  b|
+---+---+
|  1|  3|
|  2|  4|
+---+---+
Oli
  • 9,766
  • 5
  • 25
  • 46
0

Instead of using pandas, if you are interested in a pyspark solution to transpose a dataframe, here is one.

data = spark.createDataFrame([[1, 2, 3], [4, 6, 7]])
data.show()
+---+---+---+
| _1| _2| _3|
+---+---+---+
|  1|  2|  3|
|  4|  6|  7|
+---+---+---+

# The idea is to explode all the rows and create one row per column containing
# the column value (c) and its index (i). We also add an index to keep track
# the original row each record was in.
# Then we group by column index, pivot by row index and that's it.
data\
    .withColumn("index", F.monotonically_increasing_id())\
    .select("index", F.explode(
         F.array(*[
             F.struct(F.lit(i).alias("i"), F.col(data.columns[i]).alias("c"))
             for i in range(len(data.columns)) 
         ])
     ).alias("value"))\
     .groupBy("value.i").pivot("index").agg(F.first(F.col("value.c")))\
     .orderBy("i").drop("i").show()
+-----------+------------+
|60129542144|128849018880|
+-----------+------------+
|          1|           4|
|          2|           6|
|          3|           7|
+-----------+------------+
Oli
  • 9,766
  • 5
  • 25
  • 46