-1

I have a dataframe as:-

Filtered_data

['defence possessed russia china','factors driving china modernise']
['force bolster pentagon','strike capabilities pentagon congress detailing china']
[missiles warheads', 'deterrent face continued advances']
......
......

I just want to split each list elements into sub-elements(tokenized words).So, output Im looking for as:-

Filtered_data

[defence, possessed,russia,factors,driving,china,modernise]
[force,bolster,strike,capabilities,pentagon,congress,detailing,china]
[missiles,warheads, deterrent,face,continued,advances]

here is my code what I have tried

for text in df['Filtered_data'].iteritems():
for i in text.split():
    print (i)
James
  • 528
  • 1
  • 6
  • 18
  • Why are downvotes? Im new into python. Sorry, if its a silly question to ask here – James Jun 27 '18 at 09:11
  • The downvotes are not because the question is silly (which it is not), but because [you do not provide sufficient information](https://stackoverflow.com/help/mcve). We have to guess your data structure, which makes the question ambiguous. – Mr. T Jun 27 '18 at 09:14
  • Also another reason is you need add your code to question, what you try... – jezrael Jun 27 '18 at 09:18

2 Answers2

1

Use list comprehension with split and flatenning:

df['Filtered_data'] = df['Filtered_data'].apply(lambda x: [z for y in x for z in y.split()])
print (df)
                                       Filtered_data
0  [defence, possessed, russia, china, factors, d...
1  [force, bolster, pentagon, strike, capabilitie...
2  [missiles, warheads, deterrent, face, continue...

EDIT:

For unique values is standard way use sets:

df['Filtered_data'] = df['Filtered_data'].apply(lambda x: list(set([z for y in x for z in y.split()])))
print (df)
                                       Filtered_data
0  [russia, factors, defence, driving, china, mod...
1  [capabilities, detailing, china, force, pentag...
2  [deterrent, advances, face, warheads, missiles...

But if ordering of values is important use pandas.unique:

df['Filtered_data'] = df['Filtered_data'].apply(lambda x: pd.unique([z for y in x for z in y.split()]).tolist())
print (df)
                                       Filtered_data
0  [defence, possessed, russia, china, factors, d...
1  [force, bolster, pentagon, strike, capabilitie...
2  [missiles, warheads, deterrent, face, continue...
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • 1
    @James - only add `set` like `list(set([z for y in x for z in y.split()]))` – jezrael Jun 27 '18 at 10:05
  • need your help on this :- `https://stackoverflow.com/questions/51574485/match-keywords-in-pandas-column-with-another-list-of-elements`. I'm not getting the solution mentioned – James Jul 28 '18 at 20:47
0

You can use itertools.chain + toolz.unique. The benefit of toolz.unique versus set is it preserves ordering.

from itertools import chain
from toolz import unique

df = pd.DataFrame({'strings': [['defence possessed russia china','factors driving china modernise'],
                               ['force bolster pentagon','strike capabilities pentagon congress detailing china'],
                               ['missiles warheads', 'deterrent face continued advances']]})

df['words'] = df['strings'].apply(lambda x: list(unique(chain.from_iterable(i.split() for i in x))))

print(df.iloc[0]['words'])

['defence', 'possessed', 'russia', 'china', 'factors', 'driving', 'modernise']
jpp
  • 159,742
  • 34
  • 281
  • 339