Code to drop rows based on a partial string is not working.
Very simple code, and it runs fine but doesn't drop the rows I want.
The original table in the pdf looks like this:
Chemical | Value | Unit | Type |
---|---|---|---|
Fluoride | 0.23 | ug/L | Lab |
Mercury | 0.15 | ug/L | Lab |
Sum of Long Chained Polymers | 0.33 | ||
Partialsum of Short Chained Polymers | 0.40 |
What I did:
import csv
import tabula
dfs = tabula.read _pdf("Test.pdf", pages= 'all')
file = "Test.pdf"
tables = tabula.read_pdf(file, pages=2, stream=True, multiple_tables=True)
table1 = tables[1]
table1.drop('Unit', axis=1, inplace=True)
table1.drop('Type', axis=1, inplace=True)
discard = ['sum','Sum']
table1[~table1.Chemical.str.contains('|'.join(discard))]
print(table1)
table1.to_csv('test.csv')
The results are that it drops the 2 columns I don't want, so that's fine. But it did not delete the rows with the words "sum" or "Sum" in them. Any insights?