I have some groundwater data, that I am trying to analyze. For every well, there is a description of the lithology surrounding the filter. My goal is to divide all the wells into main categories based on the dominant lithology type, so that I can make separate histograms for all the lithologies (based on concentrations of fluoride in the water).
The problem I have is that I want to convert the lithology column from example: bk: 86%, lk: 14% --> bk. So, I want to write a code that can recognize the biggest contributor from an example such as this - the problem is I have 9000 wells to do this for, and there are many different lithologies with different contributions.
Another problem I have, is that for some of the wells there are 2 lithology types with equal contributions - my idea here was to create one entry for each, with same Well ID.
However, I am really lost? If anyone can point me in the right direction it would be much appreciated.
My code looks like this:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
#read data from excel
df1 = pd.read_excel(r"data1.xlsx", 'sheet2')
df2 = pd.read_excel(r"data2.xlsx", 'sheet2')
#merges dataframes
df = pd.concat([df1, df2])
df['INDTAG_LITHOLOGI'].replace('', np.nan, inplace=True) #replace empty fields with 'not a number'
df.dropna(subset=['INDTAG_LITHOLOGI'], inplace=True) #removes all rows with NaN
My idea was to use the max() function in some way if I can make python recognize each entry as a list of values (don't know if it is already a list when it is in a dataframe or simply a string). And in some way use the pandas replace function to reach my final goal, however there are a bunch of steps in between that I have no idea how to achieve. Every entry is in the form xx 57%, yy 32%,...
Any help is much appreciated, I am so lost on this atm.
EDIT: My dataframe looks something like this:
df = pd.DataFrame([['1. 172', 'hs: 88%, hi: 13%'], ['1. 174', 'hs:100%'], ['5. 129', 'ds: 96%, l : 4%']], columns=['WELL_ID', 'LITHOLOGY'])