1

So want to count the occurrences of contaminants but some cases has more than one contaminants so when I use the value_counts it counts them as one. For example "Gasoline, Diesel = 8" How would I count the them as separate without doing it manually.

And would it be possible to create a function that would make it easier to categorize them into lets say 4 types of contaminant? I just need a clue or a simple explanation on what I need to do.

data=pd.read_csv('Data gathered.csv') data

data['CONTAMINANTS'].value_counts().plot(kind = 'barh').invert_yaxis()
  • 1
    Does this answer your question? [Split (explode) pandas dataframe string entry to separate rows](https://stackoverflow.com/questions/12680754/split-explode-pandas-dataframe-string-entry-to-separate-rows) – Trenton McKinney Apr 26 '21 at 01:31
  • **[Don't Post Screenshots](https://meta.stackoverflow.com/questions/303812/)**. Always provide a [mre], with **code, data, errors, current output, and expected output, as [formatted text](https://stackoverflow.com/help/formatting)**. It's likely the question will be down-voted and closed. You're discouraging assistance because no one wants to retype your data or code, and screenshots are often illegible. [edit] the question and **add text**. Please see [How to provide a reproducible copy of your DataFrame using `df.head(15).to_clipboard(sep=',')`](https://stackoverflow.com/questions/52413246). – Trenton McKinney Apr 26 '21 at 02:18

1 Answers1

1

Assuming the contaminants are always separated by commas in your data, you can use pandas.Series.str.split() to get them into lists. Then you can get them into distinct rows with pandas.DataFrame.explode(), which finally allows using the value_counts() method.

For example:

import pandas as pd

data = pd.DataFrame({'File Number': [1, 2, 3, 4],
                     'CONTAMINANTS': ['ACENAPHTENE, ANTHRACENE, BENZ-A-ANTHRACENE', 
                                      'CHLORINATED SOLVENTS', 
                                      'DIESEL, GASOLINE, ACENAPHTENE', 
                                      'GASOLINE, ACENAPHTENE']})
data
    File Number     CONTAMINANTS
0   1               ACENAPHTENE, ANTHRACENE, BENZ-A-ANTHRACENE
1   2               CHLORINATED SOLVENTS
2   3               DIESEL, GASOLINE, ACENAPHTENE
3   4               GASOLINE, ACENAPHTENE
data['CONTAMINANTS'] = data['CONTAMINANTS'].str.split(pat=', ')
data_long = data.explode('CONTAMINANTS')
data_long['CONTAMINANTS'].value_counts()
ACENAPHTENE             3
GASOLINE                2
DIESEL                  1
ANTHRACENE              1
BENZ-A-ANTHRACENE       1
CHLORINATED SOLVENTS    1
Name: CONTAMINANTS, dtype: int64

To categorize the contaminants, you could define a dictionary that maps them to types. Then you can use that dictionary to add a column of types to the exploded dataframe:

types = {'ACENAPHTENE': 1, 
         'GASOLINE': 2,
         'DIESEL': 2, 
         'ANTHRACENE': 1,
         'BENZ-A-ANTHRACENE': 1,
         'CHLORINATED SOLVENTS': 3}

data_long['contaminant type'] = data_long['CONTAMINANTS'].apply(lambda x: types[x])
data_long
    File Number     CONTAMINANTS            contaminant type
0   1               ACENAPHTENE             1
0   1               ANTHRACENE              1
0   1               BENZ-A-ANTHRACENE       1
1   2               CHLORINATED SOLVENTS    3
2   3               DIESEL                  2
2   3               GASOLINE                2
2   3               ACENAPHTENE             1
3   4               GASOLINE                2
3   4               ACENAPHTENE             1
Arne
  • 9,990
  • 2
  • 18
  • 28
  • Hello why am I getting an error '6' `ACT = {'0': 'No Activity', '1A' : 'CONTAMINATION CONFIRMED', '1B' : 'CONTAMINATION CONFIRMED', '2A' :'INVESTIGATION', '2B': 'INVESTIGATION', '3':'CORRECTIVE ACTION PLANNING', '4': 'IMPLEMENT ACTION', '5': 'MONITOR ACTION', '6A':'ACTION COMPLETED', '6B':'ACTION COMPLETED', '6C': 'INACTIVE', '6D': 'INACTIVE' }` `data['STATUS'] = data['ACT-STATUS'].apply(lambda x: ACT[x]) data` – geo_codernoob Apr 29 '21 at 01:22
  • Hi @geo_codernoob - maybe one of the values in `data['ACT-STATUS']` is `'6'`? Then the `apply()` method would try to look up `ACT['6']`, which you haven't defined. – Arne Apr 29 '21 at 13:29
  • Hi, Can you explain what does the lambda do? – geo_codernoob May 04 '21 at 14:56
  • It's a shortcut way to define a function, useful if you only need the function in one place and it does not need a name. In this case, `df.apply()` expects a function as its argument, but `types` is a dictionary, so we transform the dictionary into a function by using the `lambda` keyword. See the documentation here: https://docs.python.org/3/reference/expressions.html#lambda – Arne May 04 '21 at 18:24