Suppose I am using Apache spark to read a dataset like this:
City | Region | Population
A | A1 | 150000
A | A2 | 50000
B | B1 | 250000
C | C1 | 350000
After creating the dataframe on top of this suppose I repartition this based on city. Now if I wish to know which node of my spark cluster is having the information of city A, is it possible to know? If yes, then how kindly explain.
Another question please, how do I know the total size of the data which is being read by spark as a dataframe?