0

Anyone know how can I do theses calculations in pyspark?

data = {
    'Name': ['Tom', 'nick', 'krish', 'jack'],
    'Age': [20, 21, 19, 18],
    'CSP': [2, 6, 8, 7],
    'coef': [2, 2, 3, 3]
}
  
# Create DataFrame
df = pd.DataFrame(data)
colsToRecalculate = ['Age','CSP']

for i in range(len(colsToRecalculate)):
    df[colsToRecalculate[i]] =df[colsToRecalculate[i]]/df["coef"]
thebluephantom
  • 16,458
  • 8
  • 40
  • 83
  • there are some good answers related to this [here](https://stackoverflow.com/q/33681487/8279585) – samkart Aug 22 '22 at 15:08

2 Answers2

1

You can use select() on spark dataframe and include multiple columns (with different calculations) as parameters. In your case:

df2 = spark.createDataFrame(pd.DataFrame(data))
df2.select(*[(F.col(c) / F.col('coef')).alias(c) for c in colsToRecalculate], 'coef').show()
bzu
  • 1,242
  • 1
  • 8
  • 14
0

Slight variation to bzu's answer which selects non-listed columns manually within the select. We can use dataframe.columns and check the columns against the colsToRecalculate list - If column is in the list, do the calculation, else leave column as is.

data_sdf. \
    select(*[(func.col(k) / func.col('coef')).alias(k) if k in colsToRecalculate else k for k in data_sdf.columns])
samkart
  • 6,007
  • 2
  • 14
  • 29