0

So I'm trying to take a dataframe like this (for example):

ID | reason_for_rejection
--------------------------
1  |   invalid insurance
2  |   behavior issues
3  |   not enough money
4  |   no space in hospital
5  |   anger issues
...

and, using a hand-written mapping (for example {financial: [invalid insurance, not enough money], patient problems: [behavior issues, anger issues]...} create a new column containing the mapped values and turn this into:

ID | reason_for_rejection    | reason_for_rejection_grouped
---------------------------------------------------------------
1  |   invalid insurance     | financial
2  |   behavior issues       | patient problems
3  |   not enough money      | financial
4  |   no space in hospital  | occupancy
5  |   anger issues          | patient problems
...

So while the 'reason_for_rejection' column will have a lot of unique values, I want to use some kind of a mapping that maps those unique values into 7 or 8 unique values in 'reason_for_rejection_grouped'.

I considered using a dictionary here, but the key would be a value in 'reason_for_rejection_grouped' and the values would be values in 'reason_for_rejection', so then I'd have to get the key based off the value which would be computationally expensive (and I have a really big dataset to look at).

Any guidance or suggestions would be super helpful!

Scorks
  • 5
  • 5
  • Why don't you build a `{reason_for_rejection: reason_for_rejection_grouped}` dictionary? Like `{'invalid insurance': 'financial', 'not enough money': 'financial', ...}` – mozway Sep 07 '21 at 19:49
  • @mozway I considered that, but I'm trying to keep the code as short and concise as possible - since there are so many unique values in 'reason_for_rejection', the dictionary would be super long – Scorks Sep 07 '21 at 19:53
  • This would probably require an nlp package and a lot of processing work in order to categorize the data. `spacy` is a good start. – Mohammad Sep 07 '21 at 20:01
  • @Mohammad thanks for the suggestion! I'm planning on writing the mapping myself - I've never used spacy before. What would that be able to help with? – Scorks Sep 07 '21 at 20:06
  • @Scorks that won't be much different in size than the other order but much more efficient. You can use an int as value to save a bit a space, then you'll map it to categorical – mozway Sep 07 '21 at 20:14
  • @Scorks - you can't write an exhaustive mapping for your problem if the input is vast or unknown. State of the art for this kind of problems: a pretrained [sentence embedding transformer](https://huggingface.co/transformers/model_doc/roberta.html) and train the last layer with a sample of your labels. – Michael Szczesny Sep 07 '21 at 20:15
  • @Scorks - if you have your proposed `reason_for_rejection_grouped : list(reason_for_rejection)` dictionary, you are done. No programming required. You can [invert](https://stackoverflow.com/questions/35491223/inverting-a-dictionary-with-list-values) the dictionary to get a mapping from `reason_for_rejection : reason_for_rejection_grouped` – Michael Szczesny Sep 07 '21 at 20:21
  • @MichaelSzczesny There are about 5000 rows, so not necessarily vast or unknown. I thought it may be better done by hand rather than through another course of action. – Scorks Sep 07 '21 at 21:42

0 Answers0