1

I'm looking to make a list of toneless pinyin combinations/permutations.

import pandas as pd
data = pd.read_csv('chinese_tones.txt', sep=" ", header=None)
data.columns = ["pinyin", "character"]
data['pinyin'] = data['pinyin'].str.replace('\d+', '')

The current format of the data is:

| pinyin| character|
|------|----|---|---|---|
| cang | 仓 |   |   |   |
| cang | 藏 |   |   |   |
| cao  | 操 |   |   |   |
| cao  | 曹 |   |   |   |
| cao  | 草 |   |   |   |

The expected result would be a list like:

cangcang
cangcao
caocang
caocao

I can dedupe the list and clean myself. I'm just trying to include every combination in every order of two pinyin.

ALollz
  • 57,915
  • 7
  • 66
  • 89
Alex
  • 73
  • 6

1 Answers1

1

You can drop_duplicates, and then use an outer addition to get all combinations.

import numpy as np
import pandas as pd

s = df['pinyin'].drop_duplicates().to_numpy()
pd.Series(np.add.outer(s, s).ravel())

#0    cangcang
#1     cangcao
#2     caocang
#3      caocao
#dtype: object

If you want to add back the original words just add `s` back to this outer addition.

pd.Series(s.tolist() + np.add.outer(s, s).ravel().tolist())
#0        cang
#1         cao
#2    cangcang
#3     cangcao
#4     caocang
#5      caocao
#dtype: object

If you want to have the individual words also then we can accomplish a similar thing with a merge, instead of dropping down to numpy. drop_duplicates again and assign a temporary key to accomplish the entire merge, then add the strings.

s = df[['pinyin']].drop_duplicates().assign(key=1)
res = s.merge(s, on='key').drop(columns='key')
res['combined'] = res['pinyin_x'] + res['pinyin_y']

#  pinyin_x pinyin_y  combined
#0     cang     cang  cangcang
#1     cang      cao   cangcao
#2      cao     cang   caocang
#3      cao      cao    caocao
ALollz
  • 57,915
  • 7
  • 66
  • 89
  • This works great, thank you. If there any way to include the individual components as well (i.e. cang and cao) in the new series? – Alex Jan 12 '21 at 19:32
  • thanks for the update. I was thinking one series or list for all of them. So "cang, cao, cangcao, caocang, etc..." Does that make sense? – Alex Jan 12 '21 at 19:49
  • 1
    @Alex oh I see, does that new addition solve your problem? – ALollz Jan 12 '21 at 19:54