1

I have a dataframe with two columns 'text' and 'lang' and I need to extract the groups (unique) of 'text' values that have the same number N of languages. For example:

For the following example dataframe:

text     lang
--------------
text_a   en
text_b   es
text_a   es
text_a   it
text_c   de
text_c   pt
text_d   no
...

I can extract the list of languages per unique text:

df.groupby('text').lang.apply(list)

and that gives me a result like this one:

text_a -> [es, en, it, fr]
text_b -> [es, it, de]
text_c -> [es, nl, it]
text_d -> [fr, no, de, pt]

Now, from this result, how can i filter all the texts that appear in the same N languages? For example, for spanish and french the desired result would be all the rows from the initial dataframe where all seleted text values also have 'es' and 'fr' on the lang column.

text     lang
--------------
text_a   fr
text_b   es
text_a   es
text_b   es
text_b   fr
text_c   fr
text_d   es
...

The output contains all texts that have a row with 'es' and a row with 'fr' and only those two appear in the output. The isin() function will not work here.

Thanks in advance.

Fran Casadome
  • 508
  • 4
  • 15

1 Answers1

1

Approach 1

Query the dataframe to filter the rows where the corresponding language is one of en, es, then group the filtered dataframe on text and transform lang column using nunique to get the counts of unique values, now compare the count to length of the defined set to create a boolean mask

lang = {'en', 'es'}
df_lang = df.query('lang in @lang')
mask = df_lang.groupby('text')['lang'].transform('nunique').eq(len(lang))

df_lang[mask]

Approach 2

Query to filter the rows then group the frame by text and aggregate lang using set, now map the aggregate column using set.issubset to test the membership of given languages

lang = {'en', 'es'}
df_lang = df.query('lang in @lang')
mask = df_lang.groupby('text')['lang'].agg(set).map(lang.issubset)

df_lang[df_lang['text'].isin(mask[mask].index)]

Result for {'en', 'es'}

     text lang
0  text_a   en
2  text_a   es
Shubham Sharma
  • 68,127
  • 6
  • 24
  • 53
  • Thank you for your answer Shubham. it has the same problem that the one provided by Anurag below: i only want the specified languages in the output (both of them). Let's see if this rephrase works: give me all the texts for BOTH spanish and french from the input dataframe – Fran Casadome Jun 28 '21 at 14:23
  • @FranCasadome The proposed approach here is exactly doing that. – Shubham Sharma Jun 28 '21 at 14:26
  • No... I get extra languages other than the ones i specify. In your example for 'en' and 'es' you also get 'it' – Fran Casadome Jun 28 '21 at 14:29
  • 1
    @FranCasadome Added another approach which should be more efficient. – Shubham Sharma Jun 28 '21 at 14:54