-1

I have this fossil data and want to create a new column with unique values for each of the unique occurrence in

GENUS = (['Microtherium', 'Bachitherium', 'Coelodonta', ..., 'Murina',
   'Boopsis', None], dtype=object)
SPECIES = (['Microtherium', 'Bachitherium', 'Coelodonta', ..., 'Murina',
   'Boopsis', None], dtype=object)

#dropping the duplicates
dffossil[['GENUS', 'SPECIES']].drop_duplicates

Now I want to have a new column with unique number for each of the the unique GENUS and SPECIES combination.

CrazyChucky
  • 3,263
  • 4
  • 11
  • 25
EnForNP
  • 1
  • 4
  • Do you want a unique number (i.e. integer) for each combination or just a unique identifier. If you want a unique identifier you could easily try ```hash(Genus_String + SPECIES_str)``` to create a hash value of each combination within the df. – itprorh66 Jul 23 '22 at 14:10
  • Is this pseudocode, just to show what your columns are? Displaying an actual (small) DataFrame would be more helpful, and make this a [mre]. (Also, don't forget the parentheses when calling `drop_duplicates()`...) – CrazyChucky Jul 23 '22 at 15:03
  • Also relevant: [Pandas-specific advice for minimal reproducible examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) – CrazyChucky Jul 23 '22 at 15:06
  • 1
    Does this answer your question? [How to create a unique identifier based on multiple columns?](https://stackoverflow.com/questions/62918481/how-to-create-a-unique-identifier-based-on-multiple-columns) – JonasV Sep 01 '22 at 13:03

1 Answers1

1

If you simply want a unique identifier for each combination of GENUS and SPECIES you can do the following:
Note: In have assumed that either GENUS or SPECIES can contain a None value, which complicates the process slightly.

So Given a DF of the form:

    GENUS   SPECIES
0   Murina  Coelodonta
1   Murina  Microtherium
2   Microtherium    Murina
3   Bachitherium    Microtherium
4   Coelodonta  None
5   Coelodonta  Coelodonta
6   Microtherium    Coelodonta
7   Microtherium    Murina
8   Microtherium    Bachitherium
9   Murina  Microtherium  

Add a column which uniquely identifies each combination of GENUS and SPECIES. We call this Column 'ID'.

Define a function to create a hash of entries, taking into account the possibility of a None entry.

def hashValues(g, s):
    if g == None:
        g = "None"
    if s == None:
        s = 'None'
    return hash(g + s)  

To add the column use the following:

df['ID'] = [hashValues(df['GENUS'].to_list()[i], df['SPECIES'].to_list()[i]) for i in range(df.shape[0])]  

which yields:

    GENUS           SPECIES         ID
0   Murina          Coelodonta      -6583287505830614713
1   Murina          Microtherium    6019734726691011903
2   Microtherium    Murina          -2318069015748475190
3   Bachitherium    Microtherium    5795352218934423262
4   Coelodonta      None            4851538573581845777
5   Coelodonta      Coelodonta      -5115794138222494493
6   Microtherium    Coelodonta      2603682196287415014
7   Microtherium    Murina          -2318069015748475190
8   Microtherium    Bachitherium    -2746445536675711990
9   Murina          Microtherium    6019734726691011903
itprorh66
  • 3,110
  • 4
  • 9
  • 21