2

I have two DataFrames:

  • df_components: list of unique components (ID, DESCRIPTION)
  • dataset: several rows and columns from a CSV (one of these columns contains the description of a component).

I need to create a new column in the dataset with the ID of the component according to the df_components.

I tried to do this way:

Creating the df_components and the ID column based on the index

components = dataset["COMPDESC"].unique()
df_components = pd.DataFrame(components, columns=['DESCRIPTION'])
df_components.sort_values(by='DESCRIPTION', ascending=True, inplace=True)
df_components.reset_index(drop=True, inplace=True)
df_components.index += 1
df_components['ID'] = df_components.index

Sample output:

                                           DESCRIPTION   ID
1                                             AIR BAGS    1
2                                     AIR BAGS:FRONTAL    2
3               AIR BAGS:FRONTAL:SENSOR/CONTROL MODULE    3
4                                 AIR BAGS:SIDE/WINDOW    4

Create the COMP_ID in the dataset:

def create_component_id_column(row):
    found = df_components[df_components['DESCRIPTION'] == row['COMPDESC']]
    return found.ID if len(found.index) > 0 else None

dataset['COMP_ID'] = dataset.apply(lambda row: create_component_id_column(row), axis=1)

However this gives me the error ValueError: Wrong number of items passed 248, placement implies 1. Being 248 the number of items on df_components.

How can I create this new column with the ID from the item found on df_components?

juliano.net
  • 7,982
  • 13
  • 70
  • 164

1 Answers1

2

Your logic seems overcomplicated. Since you are currently creating df_components from dataset, a better idea would be to use Categorical Data with dataset. This means you do not need to create df_components.

Step 1

Convert dataset['COMPDESC'] to categorical.

dataset['COMPDESC'] = dataset['COMPDESC'].astype('category')

Step 2

Create ID from categorical codes. Since categories are alphabetically sorted by default and indexing starts from 0, add 1 to the codes.

dataset['ID'] = dataset['COMPDESC'].cat.codes + 1

If you wish, you can extract the entire categorical mapping to a dictionary:

cat_map = dict(enumerate(dataset['COMPDESC'].cat.categories))

Remember that there always be a 1-offset if you want your IDs to begin at 1. In addition, you will need to update 'ID' explicitly every time 'DESCRIPTION' changes.

Advantages of using categorical data

  • Memory efficient: strings are only stored once.
  • Structure: you define the categories and have an automatic layer of data validation.
  • Consistent: since category to code mappings are always 1-to-1, they will always be consistent, even when new categories are added.
jpp
  • 159,742
  • 34
  • 281
  • 339
  • Thanks for your answer, @jpp. I understood the advantages of using categorial data, but for learning purposes can you point what I'm missing in my code that is causing the error? I'll change my code to the way you suggested, but I really wanted to understand the cause of this error. – juliano.net Apr 28 '18 at 23:51
  • @JulianoNunesSilvaOliveira, I'll be honest, I'm not smart enough to instantly visualize the effects of your code. If you can build a dataframe which demonstrates the problem (with 10 rows, not 248), and [edit](https://stackoverflow.com/posts/50081500/edit) this into your question, then I can probably debug for you. See [How to make good reproducible pandas examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) if you need help with this. – jpp Apr 29 '18 at 00:01
  • 1
    I wasn't able to make `pd.read_clipboard(sep='\s\s+')` work on http://notebooks.azure.com, so I'll leave the search for the cause of this error for another moment. Thanks for your help. – juliano.net Apr 30 '18 at 12:26