Assuming the contaminants are always separated by commas in your data, you can use pandas.Series.str.split()
to get them into lists. Then you can get them into distinct rows with pandas.DataFrame.explode()
, which finally allows using the value_counts()
method.
For example:
import pandas as pd
data = pd.DataFrame({'File Number': [1, 2, 3, 4],
'CONTAMINANTS': ['ACENAPHTENE, ANTHRACENE, BENZ-A-ANTHRACENE',
'CHLORINATED SOLVENTS',
'DIESEL, GASOLINE, ACENAPHTENE',
'GASOLINE, ACENAPHTENE']})
data
File Number CONTAMINANTS
0 1 ACENAPHTENE, ANTHRACENE, BENZ-A-ANTHRACENE
1 2 CHLORINATED SOLVENTS
2 3 DIESEL, GASOLINE, ACENAPHTENE
3 4 GASOLINE, ACENAPHTENE
data['CONTAMINANTS'] = data['CONTAMINANTS'].str.split(pat=', ')
data_long = data.explode('CONTAMINANTS')
data_long['CONTAMINANTS'].value_counts()
ACENAPHTENE 3
GASOLINE 2
DIESEL 1
ANTHRACENE 1
BENZ-A-ANTHRACENE 1
CHLORINATED SOLVENTS 1
Name: CONTAMINANTS, dtype: int64
To categorize the contaminants, you could define a dictionary that maps them to types. Then you can use that dictionary to add a column of types to the exploded dataframe:
types = {'ACENAPHTENE': 1,
'GASOLINE': 2,
'DIESEL': 2,
'ANTHRACENE': 1,
'BENZ-A-ANTHRACENE': 1,
'CHLORINATED SOLVENTS': 3}
data_long['contaminant type'] = data_long['CONTAMINANTS'].apply(lambda x: types[x])
data_long
File Number CONTAMINANTS contaminant type
0 1 ACENAPHTENE 1
0 1 ANTHRACENE 1
0 1 BENZ-A-ANTHRACENE 1
1 2 CHLORINATED SOLVENTS 3
2 3 DIESEL 2
2 3 GASOLINE 2
2 3 ACENAPHTENE 1
3 4 GASOLINE 2
3 4 ACENAPHTENE 1