2

I want to join these two dataframes:

df1 = pd.DataFrame({"text": ["example one", "example word one", "example two", "example sentance one"],
                    "label_1": ["O O", "O W O", "O O", "O S O"]})

df2 = pd.DataFrame({"text": ["example one", "example word one", "total example", "example sentance one"],
                    "label_2": ["O N", "O O N", "O O", "O O N"]})


                   text label_1
0           example one     O O
1      example word one   O W O
2           example two     O O
3  example sentance one   O S O

                   text label_2
0           example one     O N
1      example word one   O O N
2         total example     O O
3  example sentance one   O O N

My desired output is this:

                   text label_1 label_2
0           example one     O O     O N
1      example word one   O W O   O O N
2           example two     O O     NaN
3  example sentance one   O S O   O O N
5         total example     NaN     O O

So text column should have unique values, and final dataframe should have label_1 and label_2 columns with matching values. NaN can be in the result dataframe, but the best thing would be to replace NaN with O values that matches the number of words in text column (for that row). So if that row has 3 words, NaN should be replaced with O O O.

I have tried this, but I haven't got the desired output:

df = pd.concat([df1, df2])

                   text label_1 label_2
0           example one     O O     NaN
1      example word one   O W O     NaN
2           example two     O O     NaN
3  example sentance one   O S O     NaN
0           example one     NaN     O N
1      example word one     NaN   O O N
2         total example     NaN     O O
3  example sentance one     NaN   O O N
taga
  • 3,537
  • 13
  • 53
  • 119
  • Have you tried [merging](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html) them together? Also see [Pandas merging 101](https://stackoverflow.com/questions/53645882/pandas-merging-101) – Alex Oct 20 '21 at 12:52
  • `df = df1.merge(df2, how="outer", on="text")` The rules for filling `NaN` are little bit more complex – Alex Oct 20 '21 at 12:55
  • Ok, thanks for the merging solution, but how to fill nan – taga Oct 20 '21 at 12:59

1 Answers1

3

Using a combination of DataFrame.merge and DataFrame.fillna

df = df1.merge(df2, how="outer", on="text")

fill_series = df["text"].str.split().str.len().apply(lambda x: " ".join("0" * x))
fill_cols = ["label_1", "label_2"]
fill_dict = {k: fill_series for k in fill_cols}

df = df.fillna(fill_dict)

Outputs:

#                    text label_1 label_2
# 0           example one     O O     O N
# 1      example word one   O W O   O O N
# 2           example two     O O     0 0
# 3  example sentance one   O S O   O O N
# 4         total example     0 0     O O

fill_series creates a pd.Series of strings to use for filling.
fill_cols are the columns that need filling.
fill_dict is the dictionary that applies the strings from the Series to the DataFrame.

Alex
  • 6,610
  • 3
  • 20
  • 38