1

<style type="text/css">
 table.tableizer-table {
  font-size: 12px;
  border: 1px solid #CCC; 
  font-family: Arial, Helvetica, sans-serif;
 } 
 .tableizer-table td {
  padding: 4px;
  margin: 3px;
  border: 1px solid #CCC;
 }
 .tableizer-table th {
  background-color: #104E8B; 
  color: #FFF;
  font-weight: bold;
 }
</style>
<table class="tableizer-table">
<thead><tr class="tableizer-firstrow"><th>Name</th><th>Description</th><th>&nbsp;</th><th>Category</th><th>Sub Description</th><th>&nbsp;</th><th>Name</th><th>A</th><th>B</th><th>C</th><th>D</th></tr></thead><tbody>
 <tr><td>Alpha</td><td>B2 A1 C1 D3</td><td>&nbsp;</td><td>A</td><td>A1</td><td>&nbsp;</td><td>Alpha</td><td>A1</td><td>B2</td><td>C1</td><td>D3</td></tr>
 <tr><td>Bravo</td><td>C3 B2 A2 D2</td><td>&nbsp;</td><td>A</td><td>A2</td><td>&nbsp;</td><td>Bravo</td><td>A2</td><td>B2</td><td>C3</td><td>D2</td></tr>
 <tr><td>Charlie</td><td>A2 B1 D1 C2</td><td>&nbsp;</td><td>B</td><td>B1</td><td>&nbsp;</td><td>Charlie</td><td>A2</td><td>B1</td><td>C2</td><td>D1</td></tr>
 <tr><td>Delta</td><td>A1 B1 C3 D3</td><td>&nbsp;</td><td>B</td><td>B2</td><td>&nbsp;</td><td>Delta</td><td>A1</td><td>B1</td><td>C3</td><td>D3</td></tr>
 <tr><td>Echo</td><td>A1 C3 B2 D1</td><td>&nbsp;</td><td>C</td><td>C1</td><td>&nbsp;</td><td>Echo</td><td>A1</td><td>B2</td><td>C3</td><td>D1</td></tr>
 <tr><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>C</td><td>C2</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td></tr>
 <tr><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>C</td><td>C3</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td></tr>
 <tr><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>D</td><td>D1</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td></tr>
 <tr><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>D</td><td>D2</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td></tr>
 <tr><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>D</td><td>D3</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td></td></tr>
</tbody></table>

Using some HTML script to display my tables:

Hopefully that code runs so that you can see the input and desired output.. The left two tables are inputs and the right table is the desired output.

Essentially I am looking for a way to bucket text inside a string ("Description") into predefined buckets ("Category"). The Categories with then be the columns in the output with the relevant Sub-Categories in the cell below.

Does that make sense? I can perform this action in excel, but I am getting to the limits of my machine due to runs with 100k+ "Names" and 500+ "Sub Categories".

  • This is not a good example -- the left-most table (table 1 of input) and the right-most table (output table) are exactly the same, save for the fact that "Description" from the left is split into columns on the right (i.e. it's not clear what the purpose of the input table 2 is). Also, please take the time to read [how to make your pandas example reproducible](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples). – cmaher Mar 22 '18 at 19:02
  • How do you define the buckets? Are they based on a single white space, or are they solely based on the sub description in the second table? What would you want to happen if you get text that doesn't belong to any bucket, or if a description lacks text for a given category? – ALollz Mar 22 '18 at 19:09
  • 1
    @cmaher thank you for the feedback. In reality the Description string will not be organized as I originally had it. I have updated to better represent reality. I will look into the link you added. – Andrew Shoe Mar 22 '18 at 19:14
  • @ALollz the buckets are predefined and known in another sheet. An analogy could be a "category" being fruit, and the sub-descriptions being apple, cherry, melon. For now, if the text doesn't belong to any bucket (which is possible with my real data), that text does not need to be bucketed and can be discarded. – Andrew Shoe Mar 22 '18 at 19:18

1 Answers1

0

You can use a split, map and some dataframe reshaping:

df_out = df1.set_index('Name')['Description'].str.split(expand=True).stack().to_frame()

df_out['cols'] = df_out[0].map(df2.set_index('Sub')['Category'])

df_out = df_out.reset_index(level=1, drop=True).set_index('cols', append=True)[0].unstack()

Output:

cols      A   B   C   D
Name                   
Alpha    A1  B2  C1  D3
Bravo    A2  B2  C3  D2
Charlie  A2  B1  C2  D1
Delta    A1  B1  C3  D3
Echo     A1  B2  C3  D1
Scott Boston
  • 147,308
  • 15
  • 139
  • 187
  • Thank you Scott. This seems to be a step in the right direction. The third line of the code does not seem to do anything - I commented it out and got the same result. I am ending up with this: https://imgur.com/a/JBAdF The unstack function doesn't seem to be working. I am very new to python, so I apologize if I am missing something obvious. Thank you again. – Andrew Shoe Mar 23 '18 at 12:14
  • So I had my index=False on the output to excel command. Removing that, I get the following: https://imgur.com/a/3yhp9 Last line still appears to not impact the results. – Andrew Shoe Mar 23 '18 at 15:00
  • Oh... you need to add 'df_out =' infront of the third line. – Scott Boston Mar 23 '18 at 15:02
  • The third line is not an inplace operation hence you need to assign it back to the df_oit variable. – Scott Boston Mar 23 '18 at 15:03
  • @andrewshoe See updated solution... typically most understand this needs to be done. Sorry for the confusion. – Scott Boston Mar 23 '18 at 15:05
  • 1
    That was it!! Thanks so much for helping a newbie! – Andrew Shoe Mar 23 '18 at 15:44