1

I have a pandas dataframe with a pipe delimited column with an arbitrary number of elements, called Parts. The number of elements in these pipe-strings varies from 0 to over 10. The number of unique elements contained in all pipe-strings is not much smaller than the number of rows (which makes it impossible for me to manually specify all of them while creating new columns).

For each row, I want to create a new column that acts as an indicator variable for each element of the pipe delimited list. For instance, if the row

...'Parts'...

...'12|34|56'

should be transformed to

...'Part_12' 'Part_34' 'Part_56'...

...1 1 1...

Because they are a lot of unique parts, these columns are obviously going to be sparse - mostly zeros since each row only contains a small fraction of unique parts.

I haven't found any approach that doesn't require manually specifying the columns (for instance, Pandas Dataframe: split column into multiple columns, right-align inconsistent cell entries). I've also looked at pandas' melt, but I don't think that's the appropriate tool.

The way I know how to solve it would be to pipe the raw CSV to another python script and deal with it on a char-by-char basis, but I need to work within my existing script since I will be processing hundreds of CSVs in this manner.

Here's a better illustration of the data

ID YEAR AMT PARTZ

1202 2007 99.34

9321 1988 1012.99 2031|8942

2342 2012 381.22 1939|8321|Amx3

3pitt
  • 899
  • 13
  • 21

1 Answers1

2

You can use get_dummies and add_prefix:

df.Parts.str.get_dummies().add_prefix('Part_')

Output:

   Part_12  Part_34  Part_56
0        1        1        1

Edit for comment and counting duplicates.

df = pd.DataFrame({'Parts':['12|34|56|12']}, index=[0])
pd.get_dummies(df.Parts.str.split('|',expand=True).stack()).sum(level=0).add_prefix('Part_')

Output:

   Part_12  Part_34  Part_56
0        2        1        1
Scott Boston
  • 147,308
  • 15
  • 139
  • 187
  • this is really elegant. However, it seems to not count duplicates (eg 12|12|34) – 3pitt Dec 28 '17 at 19:40
  • @MikePalmice `pd.get_dummies(df.Parts.str.split('|',expand=True).stack()).sum(level=0).add_prefix('Part_')` – Scott Boston Dec 28 '17 at 19:45
  • the two solutions produce dataframes with different dimensions - specifically the second one has 25% as many rows. the indices are 0,3,6,7,9 etc – 3pitt Dec 28 '17 at 20:37
  • Would you mind starting a new question with more roboust data for testing? – Scott Boston Dec 28 '17 at 20:38
  • No problem, here's the link-https://stackoverflow.com/questions/48014555/make-new-pandas-columns-based-on-pipe-delimited-column-with-possible-repeats. I noticed that the excluded indices have empty values for `Parts`. so the only 'problem' with your revised solution is that it produces NaN instead of 0. However it should be easy to replace NaN with 0 in all columns that begin with 'Part_'. (it was my mistake to examine the result in Excel instead of from the shell!) – 3pitt Dec 28 '17 at 21:23