Can Pandas recognize biggest contributor from a string and replace the entry with the name of that contributor?

Question

I have some groundwater data, that I am trying to analyze. For every well, there is a description of the lithology surrounding the filter. My goal is to divide all the wells into main categories based on the dominant lithology type, so that I can make separate histograms for all the lithologies (based on concentrations of fluoride in the water).

The problem I have is that I want to convert the lithology column from example: bk: 86%, lk: 14% --> bk. So, I want to write a code that can recognize the biggest contributor from an example such as this - the problem is I have 9000 wells to do this for, and there are many different lithologies with different contributions.

Another problem I have, is that for some of the wells there are 2 lithology types with equal contributions - my idea here was to create one entry for each, with same Well ID.

However, I am really lost? If anyone can point me in the right direction it would be much appreciated.

My code looks like this:

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

#read data from excel
df1 = pd.read_excel(r"data1.xlsx", 'sheet2')
df2 = pd.read_excel(r"data2.xlsx", 'sheet2')
#merges dataframes
df = pd.concat([df1, df2])

df['INDTAG_LITHOLOGI'].replace('', np.nan, inplace=True) #replace empty fields with 'not a number'
df.dropna(subset=['INDTAG_LITHOLOGI'], inplace=True) #removes all rows with NaN

My idea was to use the max() function in some way if I can make python recognize each entry as a list of values (don't know if it is already a list when it is in a dataframe or simply a string). And in some way use the pandas replace function to reach my final goal, however there are a bunch of steps in between that I have no idea how to achieve. Every entry is in the form xx 57%, yy 32%,...

Any help is much appreciated, I am so lost on this atm.

EDIT: My dataframe looks something like this:

df = pd.DataFrame([['1.  172', 'hs: 88%, hi: 13%'], ['1.  174', 'hs:100%'], ['5.  129', 'ds: 96%, l :  4%']], columns=['WELL_ID', 'LITHOLOGY'])

Can you add an example of what the data looks like (see https://stackoverflow.com/a/20159305/20121320) — Florent Monin, Feb 23 '23 at 10:09
Ahh, knew I had forgotten something... I added an edit with an example now! — Aslak Holm Brunvand, Feb 23 '23 at 10:23
Looks like you will need to parse the the column `LITHOLOGY`, so for instance, to parse it into a dict, if `litho` is the string like `"ds: 96%, l : 4%"`, you can split it into the different components, and parse all the components: `d = {value.split(":")[0].strip() : int(value.split(":")[1].strip().removesuffix("%")) for value in litho.split(",")}` (this is a terribly ugly one-liner, but will produce a dictionary) — Florent Monin, Feb 23 '23 at 10:39

user19077881 · Answer 1 · 2023-02-23T17:25:07.603

0

You could use the fact that your string is in dictionary format as follows:

def func(s):
    try:
        d = dict((a.strip(), int(b.replace('%','').strip())) for a, b in (part.split(':') for part in s.split(', ')))
        return max(d, key=d.get)
    except:
        return 'error'

df['maxLith'] = df['LITHOLOGY'].map(func)

print(df)

which gives:

   WELL_ID         LITHOLOGY maxLith
0  1.  172  hs: 88%, hi: 13%      hs
1  1.  174           hs:100%      hs
2  5.  129  ds: 96%, l :  4%      ds

If there are rows with an error in the format which does not convert then these rows can be shown using:

print(df[df['maxLith'] == 'error'])

edited Feb 23 '23 at 17:25

answered Feb 23 '23 at 10:38

user19077881

3,643
2
3
14

That looks really solid! Thank you. I am a bit unsure of what: for a, b in ((part.split(': ') for part in s.split(', ')). I am also getting "ValueError: not enough values to unpack (expected 2, got 1)". Some of my Lithology entries have more than 2 lithologies, would this be a problem? – Aslak Holm Brunvand Feb 23 '23 at 11:22
It will deal with any number of entries in the format `name: number%`. The two splits firstly split the 'entries' separated by a comma and then then name-number separated by the colon. It looks like you have a data content problem (which is not unusual) - I guess that at least one of your entries is not in the same format so can't be split and converted in this way. You need to find out what that/those entries are so the code can reflect (or convert the entry to the format). You could write some code to search for non-compliant entries. – user19077881 Feb 23 '23 at 11:42
Code now modified to identify and deal with errors in format by putting 'error' in the new column. These rows can then be examined and corrected. – user19077881 Feb 23 '23 at 17:26

Can Pandas recognize biggest contributor from a string and replace the entry with the name of that contributor?

1 Answers1