0

I have two pandas DataFrames in python. DF A contains a column, which is basically sentence-length strings.

|---------------------|------------------|
|        sentenceCol  |    other column  |
|---------------------|------------------|
|'this is from france'|         15       |
|---------------------|------------------|

DF B contains a column that is a list of countries

|---------------------|------------------|
|        country      |    other column  |
|---------------------|------------------|
|'france'             |         33       |
|---------------------|------------------|
|'spain'              |         34       |
|---------------------|------------------|

How can I loop through DF A and assign which country the string contains? Here's what I imagine DF A would look like after assignment...

|---------------------|------------------|-----------|
|        sentenceCol  |    other column  | country   |
|---------------------|------------------|-----------|
|'this is from france'|         15       |  'france' |
|---------------------|------------------|-----------|

One additional complication is that there can be more than one country per sentence, so ideally this could assign every applicable country to that sentence.

|-------------------------------|------------------|-----------|
|        sentenceCol            |    other column  | country   |
|-------------------------------|------------------|-----------|
|'this is from france and spain'|         16       |  'france' |
|-------------------------------|------------------|-----------|
|'this is from france and spain'|         16       |  'spain'  |
|-------------------------------|------------------|-----------|
Justin
  • 153
  • 2
  • 10

2 Answers2

3

There's no need for a loop here. Looping over a dataframe is slow and we have optimized pandas or numpy methods for almost all of our problems.

In this case, for your first problem, you are looking for Series.str.extract:

dfa['country'] = dfa['sentenceCol'].str.extract(f"({'|'.join(dfb['country'])})")

           sentenceCol  other column country
0  this is from france            15  france

For your second problem, you need Series.str.extractall with Series.drop_duplicates & to_numpy:

dfa['country'] = (
    dfa['sentenceCol'].str.extractall(f"({'|'.join(dfb['country'])})")
        .drop_duplicates()
        .to_numpy()
)

                     sentenceCol  other column country
0  this is from france and spain            15  france
1  this is from france and spain            15   spain

Edit

Or if your sentenceCol is not duplicated, we have to get the extracted values to a single row. We use GroupBy.agg:

dfa['country'] = (
    dfa['sentenceCol'].str.extractall(f"({'|'.join(dfb['country'])})")
        .groupby(level=0)
        .agg(', '.join)
        .to_numpy()
)

                     sentenceCol  other column        country
0  this is from france and spain            15  france, spain

Edit2

To duplicate the original rows. We join the dataframe back to our extraction:

extraction = (
    dfa['sentenceCol'].str.extractall(f"({'|'.join(dfb['country'])})")
        .rename(columns={0: 'country'})
)

dfa = extraction.droplevel(1).join(dfa).reset_index(drop=True)

  country                    sentenceCol  other column
0  france  this is from france and spain            15
1   spain  this is from france and spain            15

Dataframes used:

dfa = pd.DataFrame({'sentenceCol':['this is from france and spain']*2,
                   'other column':[15]*2})

dfb = pd.DataFrame({'country':['france', 'spain']})
Erfan
  • 40,971
  • 8
  • 66
  • 78
  • 1
    Thanks @NaturalFrequency appreciate it! – Erfan Dec 29 '19 at 19:15
  • Ah, I see one problem. So my dfa in this instance does not already have the string values duplicated, so dfa would just be: `dfa = pd.DataFrame({'sentenceCol':['this is from france and spain'], 'other column':[15]})` . That is why I'm getting a Value Error: `ValueError: Length of values does not match length of index` – Justin Dec 29 '19 at 19:45
  • Not sure if I understand you, does the solution work or do you get that error? – Erfan Dec 29 '19 at 19:57
  • I get that error, because my data is not already duplicated dependent on if there are multiple country names involved. So the sentence 'this is from france and spain' is just one row in my dataset. – Justin Dec 29 '19 at 20:05
  • Thanks for the response, apologies if I'm making this confusing. I see what you did there, concatenating the country tags with a comma. But instead of that, is it possible, to actually duplicate the sentence by creating a new row with each row containing one country a piece? – Justin Dec 29 '19 at 20:15
  • I see what you mean. See **edit2**, this should work as intended @Justin – Erfan Dec 29 '19 at 20:25
  • Thank you for this great answer. However, If you don't mind me adding on - what if the sentenceCol can be duplicated, but we still want the countries commpa separated in the country column for each occurence? – bjornasm Mar 11 '20 at 21:08
0

You can iterate through a dataframe with the method iterrows(). You can try this:

# Dataframes definition
df_1 = pd.DataFrame({"sentence": ["this is from france and spain", "this is from france", "this is from germany"], "other": [15, 12, 33]})
df_2 = pd.DataFrame({"country": ["spain", "france", "germany"], "other_column": [7, 7, 8]})


# Create the new dataframe
df_3 = pd.DataFrame(columns = ["sentence", "other_column", "country"])
count=0

# Iterate through the dataframes, first through the country dataframe and inside through the sentence one.
for index, row in df_2.iterrows():
    country = row.country

    for index_2, row_2 in df_1.iterrows():
        if country in row_2.sentence:
            df_3.loc[count] = (row_2.sentence, row_2.other, country)
            count+=1

So the output is:

sentence                            other_column    country
0   this is from france and spain   15              spain
1   this is from france and spain   15              france
2   this is from france             12              france
3   this is from germany            33              germany
sergiomahi
  • 964
  • 2
  • 8
  • 21
  • 1
    Thanks! I believe this did work when testing it out, but it also took ~5 mins to run (about 9000 sentence rows and 270 country rows) – Justin Dec 29 '19 at 19:29