0

I have several dataframes which contains single column in them. Let's say I have 4 such dataframe all with one column. How can I form a single dataframe by combining all of them?

val df = xmldf.select(col("UserData.UserValue._valueRef"))
val df2 = xmldf.select(col("UserData.UserValue._title"))
val df3 = xmldf.select(col("author"))
val df4 = xmldf.select(col("price"))

To combine, I am trying this, but it doesn't work:

var newdf = df
newdf = newdf.withColumn("col1",df1.col("UserData.UserValue._title"))
newdf.show()

It errors out saying that field of one column are not present in another. I am not sure how can I combine these 4 dataframes together. They don't have any common column.

df2 looks like this:

+---------------+
|         _title|
+---------------+
|_CONFIG_CONTEXT|
|_CONFIG_CONTEXT|
|_CONFIG_CONTEXT|
+---------------+

and df looks like this:

+-----------+
|_valuegiven|
+-----------+
|        qwe|
|     dfdfrt|
|       dfdf|
+-----------+

df3 and df4 are also in same format. I want like below dataframe:

+-----------+---------------+
|_valuegiven|         _title|
+-----------+---------------+
|        qwe|_CONFIG_CONTEXT|
|     dfdfrt|_CONFIG_CONTEXT|
|       dfdf|_CONFIG_CONTEXT|
+-----------+---------------+

I used this:

val newdf = xmldf.select(col("UserData.UserValue._valuegiven"),col("UserData.UserValue._title") )
newdf.show()

But I am getting column name on the go and as such, I would need to append on the go, due to which I don't know exactly how many columns I will get. Which is why I cannot use the above command.

CodeHunter
  • 2,017
  • 2
  • 21
  • 47
  • In this case, you could load all the data and then select the columns you want. If you need to add columns simply reselect the ones you want. Then you don't need to use multiple joins and there is no need to consider that the row order is not fixed. – Shaido Jul 13 '18 at 06:27

3 Answers3

0

It's a little unclear of your goal. If asking to join these dataframes, but perhaps you just want to select those 4 columns.

val newdf = xmldf.select($"UserData.UserValue._valueRef", $"UserData.UserValue._title", 'author,'price")
newdf.show

If you really want to join all these dataframes, you'll need to join them all and select the appropriate fields.

codeaperature
  • 1,089
  • 2
  • 10
  • 25
  • if you want specific col names, use `select($"UserData.UserValue._valueRef" as "newname", `... – codeaperature Jul 12 '18 at 21:10
  • I updated what I exactly need in my question. I can't use `val newdf = xmldf.select(col("UserData.UserValue._valuegiven"),col("UserData.UserValue._title") ) newdf.show()` due to the fact that I will be getting these column names which I need to append one by one and hence can't provide them all at once as in the above statement. – CodeHunter Jul 12 '18 at 21:16
  • The reason you error out is because you are selecting `val df = xmldf.select(col("UserData.UserValue._valueRef"))` and then using the following to select a column not in df which is the same as newdf. If you want to select all items and add items you can try `xmldf.select($"*", $"UserData.UserValue._valueRef" as "valref",` ...`)` var newdf = df newdf = newdf.withColumn("col1",df1.col("UserData.UserValue._title")) newdf.show() – codeaperature Jul 12 '18 at 21:24
  • yeah. I tried this as well but it do not work. I think there should be a very simple way to append them together. I just don't know how. – CodeHunter Jul 12 '18 at 21:26
0

If the goal is to get 4 columns from xmldf into a new dataframe you shouldn't be splitting it into 4 dataframes in the first place.

You can select multiple columns from a dataframe by providing additional column names in the select function.

val newdf = xmldf.select(
    col("UserData.UserValue._valueRef"), 
    col("UserData.UserValue._title"),
    col("author"), 
    col("price"))
newdf.show()
  • That is what I couldn't do. I am getting column names one by one and hence, I do not know how many columns I will get. So, I cannot write it as a single statement. I have mentioned it in my problem statement. – CodeHunter Jul 12 '18 at 21:19
0

So I looked at various ways and finally Ram Ghadiyaram's answer in Solution 2 does what I wanted to do. Using this approach, you can combine any number of columns on the go. Basically, you need to create indexes by which you can join the dataframes together and after joining, drop the index column altogether.

CodeHunter
  • 2,017
  • 2
  • 21
  • 47