0

I have been reading documentation for a few hours now and I feel I am approaching the problem with the wrong mindset.

I have two tables in HIVE which I read with (spark.table(table_A)) with the same amount and type of columns, but with different origins, so their data is different. Both tables reflect flags that show whether or not a condition is met. There are around 20 columns, at least, and they could increase in the future.

If table_A has its first row be 0 0 1 1 0 table_B could be 0 1 0 1 0, I would like the result to be the result of a XNOR, comparing positions, so: 1 0 0 1 1 , since it has the same values in the first, fourth and fifth position

So I thought of the XNOR operation, when if boths values match then it returns a 1, and a 0 otherwise.

I am facing a number of problems, one of them is the volume of my data (right now I am working with a sample of 1 week and it's already at the 300MB mark), so I am working with pyspark and avoiding pandas since it usually does not fit in memory and/or lags the operation a lot.

Summing up, I have two objects of type pyspark.sql.dataframe.DataFrame, each has one of the tables, and so far the best I've got is something like this:

df_bitwise = df_flags_A.flag_column_A.bitwiseXOR(df_flags_B.flag_columns_B)

But sadly this returns a pyspark.sql.column.Column and I do not know how to read that result, and I do not know to build a dataframe with this (I would like the end result to be something like 20 times the above operation, one for each column, each forming a column of a dataframe).

What am I doing wrong because I feel like this is not the right approach.

monkey intern
  • 705
  • 3
  • 14
  • 34
  • 2
    A [reproducible example](https://stackoverflow.com/questions/48427185/how-to-make-good-reproducible-apache-spark-dataframe-examples) would be helpful but IIUC, you should be able to follow the approach [here](https://stackoverflow.com/questions/32670958/spark-dataframe-computing-row-wise-mean-or-any-aggregate-operation) and [here](https://stackoverflow.com/questions/33882894/sparksql-apply-aggregate-functions-to-a-list-of-column). – pault Dec 10 '18 at 15:27
  • I will further read about spark reproducible examples tomorrow, that's a needed improvement in my question (I always fight between being overly verbose and not giving enough info/data). However the second link points to what I was suspecting, it is simply easier to go the SQL route to do this. That's very interesting to me, but a bit weird (why not simply write a query for this task?) – monkey intern Dec 10 '18 at 15:30
  • You don't need to go the SQL route (though I'm still not 100% clear on what you're trying to do)- you can write a list comprehension over the columns in your two DataFrames. – pault Dec 10 '18 at 15:32

0 Answers0