-1

Code to drop rows based on a partial string is not working.

Very simple code, and it runs fine but doesn't drop the rows I want.

The original table in the pdf looks like this:

Chemical Value Unit Type
Fluoride 0.23 ug/L Lab
Mercury 0.15 ug/L Lab
Sum of Long Chained Polymers 0.33
Partialsum of Short Chained Polymers 0.40

What I did:

import csv 
import tabula

dfs = tabula.read _pdf("Test.pdf", pages= 'all')
file = "Test.pdf"
tables = tabula.read_pdf(file, pages=2, stream=True, multiple_tables=True)

table1 = tables[1]
table1.drop('Unit', axis=1, inplace=True) 
table1.drop('Type', axis=1, inplace=True)
discard = ['sum','Sum']
table1[~table1.Chemical.str.contains('|'.join(discard))]
print(table1)
table1.to_csv('test.csv')

The results are that it drops the 2 columns I don't want, so that's fine. But it did not delete the rows with the words "sum" or "Sum" in them. Any insights?

Karl
  • 11
  • 1
  • Please post a working example - We don't have your pdf, but really, that part is irrelevant. You could initialize `table1` to the example values. And then fix the invalid reference to `.Parameter`. – tdelaney Nov 20 '22 at 18:13
  • I fixed the "Parameter" - was supposed to be "Chemical". I am not sure how to "post a working example", so maybe I am in over my head here. If it was working, I wouldn't be asking, so maybe I am misunderstanding. The above code runs, but doesn't delete the rows containing "sum" in the Chemical column. – Karl Nov 20 '22 at 18:24
  • Yeah, narrowing down examples can be challenging but helps the rest of us. In my answer I included a note on how I generated the script. Its a good skill to develop as a python developer. Whenever I hit a roadblock, I fire up the python shell or write one-off test scripts to focus on just problem area. – tdelaney Nov 20 '22 at 19:04
  • See [How to make good reproducible pandas examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) – BeRT2me Nov 21 '22 at 00:07
  • Also see: [In pandas, is inplace = True considered harmful, or not?](https://stackoverflow.com/a/60020384/11865956) – BeRT2me Nov 21 '22 at 00:24

2 Answers2

0

You are close. You did drop the rows, but you didn't save the result.

import pandas as pd

example = {'Chemical': ['Fluoride', 'Mercury', 'Sum of Long Chained Polymers',
                'Partialsum of Short Chained Polymers'], 
            'Value': [0.23, 0.15, 0.33, 0.4], 
            'Unit': ['ug/L', 'ug/L', '', ''], 
            'Type': ['Lab', 'Lab', '', '']}

table1 = pd.DataFrame(example)
table1.drop('Unit', axis=1, inplace=True)
table1.drop('Type', axis=1, inplace=True)
discard = ['sum','Sum']
table1 = table1[~table1.Chemical.str.contains('|'.join(discard))]
print(table1)
tdelaney
  • 73,364
  • 6
  • 83
  • 116
  • To get this example, I copied your table and used `table1 = pd.read_clipboard()`. But that's still not something others can replicate, so I used `table1.to_dict()` to write a dictionary. Pasting that into a test.py file, I had the data without the pdf part. Its more focused and others can run it. – tdelaney Nov 20 '22 at 19:02
0

You can use pd.Series.str.contains with the argument case=False to ignore case:

Also, it's not law, but often considered poor practice to use inplace=True... because in part it leads to confusions like the one you're experiencing.

Given df:

                               Chemical  Value  Unit  Type
0                              Fluoride   0.23  ug/L   Lab
1                               Mercury   0.15  ug/L   Lab
2          Sum of Long Chained Polymers   0.33   NaN   NaN
3  Partialsum of Short Chained Polymers   0.40   NaN   NaN

Doing:

df = (df.drop(['Unit', 'Type'], axis=1)
        .loc[~df.Chemical.str.contains('sum', case=False)])

Output:

   Chemical  Value
0  Fluoride   0.23
1   Mercury   0.15
BeRT2me
  • 12,699
  • 2
  • 13
  • 31