0

I have a DataFrame (test_df) that looks like,

Year    Month   TAGS
2019    5   A, B
2019    5   A, C
2019    5   A
2019    5   
2019    5   B, C, D
2019    5   C, E

I would like to get a Tags LIST that looks like this stacked up vertically when I split the tags string by comma.

A
B
A
C
A
B
C
D
C
E

I utilized 2 For loops in order to get the list of tags

check=[]

for j in range(len(test_df)): 

  for i in range(len(test_df['TAGS'][j].split(', '))):

      check.append(test_df['TAGS'][j].split(', ')[i])

Is there a better way to get the TAGS list without the use of 2 For loops.

harvpan
  • 8,571
  • 2
  • 18
  • 36
kb123
  • 3
  • 1
  • Are your Tags a single character or really anything separated by a comma? – ALollz Jun 24 '19 at 18:50
  • Did you try `test_df['TAGS'].tolist()` – arajshree Jun 24 '19 at 18:50
  • 1
    Still iterating, but try `itertools.chain.from_iterable(s.split(', ') for s in df.TAGS if s is not None)`. Should be faster than your current approach – user3483203 Jun 24 '19 at 18:50
  • Also, can you verify. Do you have a Series of Lists `['A', 'B']` or a Series of strings `'A, B'`? If they're lists it's just `pd.Series(chain.from_iterable(df.Tags))` – ALollz Jun 24 '19 at 18:52
  • @arajshree, If I do test_df['TAGS'].tolist(), I get as A,B \n A,C \n A \n B,C,D \n C,E but i expect something as A \n B \n A \n C \n A \n B \n C \n D \n C \n E – kb123 Jun 24 '19 at 18:54
  • @ALollz, It's not a single character but anything separated by comma – kb123 Jun 24 '19 at 18:55
  • are you looking for `df["TAGS"].str.split(", ").apply(pd.Series).stack().reset_index(drop=True)`? – pault Jun 24 '19 at 18:57
  • @pault, this is it.. Your suggestion worked – kb123 Jun 24 '19 at 19:00

1 Answers1

1

IIUC, you can first split the TAGS column on ", ":

df["TAGS"].str.split(", ")
#0       [A, B]
#1       [A, C]
#2          [A]
#3         None
#4    [B, C, D]
#5       [C, E]

Then adapt the code from this answer to get your final output:

df["TAGS"].str.split(", ").apply(pd.Series).stack().reset_index(drop=True)
#0    A
#1    B
#2    A
#3    C
#4    A
#5    B
#6    C
#7    D
#8    C
#9    E
pault
  • 41,343
  • 15
  • 107
  • 149