I'm trying to clean a pdf to turn it into a file for geocoding. I've been using tabula-py to rip the pdf and have had pretty good results up until the point of removing rows that are empty entirely. I'm not even sure if this is an efficient way to do this.
I've tried the majority of solutions SO has recommended to me and I still can't quite figure it out. I've set inplace = True, axis = 0 and 1, how="all". Tried indexing out the NaN values and that didn't work either.
import pandas as pd
import tabula
pd.set_option('display.width', 500)
df = tabula.read_pdf("C:\\Users\\Jack\\Documents\\Schoolwork\\Schoolwork\\WICResearch\\RefDocs\\wicclinicdirectory.pdf", pages='all', guess = False, pandas_options={'header': None})
df.columns = ["County_Name", "Clinic_Number", "Clinic_Name", "Address", "City", "Zip_Code", "Phone_Number", "Hours_of_Operation"]
df.drop(["Phone_Number", "Hours_of_Operation"], axis = 1, inplace = True)
#mass of code here that removes unwanted repeated column headers as by product of tabula reading PDFs.
df.drop(["Clinic_Name"], axis = 1, inplace = True)
df[['ClinicNum','ClinicName']] = df.Clinic_Number.apply(lambda x: pd.Series(str(x).split(" ", maxsplit = 1)))
df.drop(["Clinic_Number"], axis = 1, inplace = True)
#df[~df.isin(['NaN', 'NaT']).any(axis=1)]
#df.dropna(axis= 0, how ='all', inplace = True)
NaNIndex = df.index[df.isnull().all(1)]
print(NaNIndex)
print(df)
The above code gives this output:
Index: []
County_Name Address City Zip_Code ClinicNum ClinicName
0 NaN NaN Ohio WIC Clinic Locations NaN nan NaN
1 NaN NaN NaN NaN Clinic NaN
3 Adams 9137 State Route 136 West Union 45693 100 Adams County WIC Program
4 NaN NaN NaN NaN nan NaN
5 NaN NaN NaN NaN nan NaN
6 Allen 940 North Cable Road, Suite 4 Lima 45805 200 Allen County WIC Program
7 Ashland 934 Center Street Ashland 44805 300 Ashland County WIC Program
8 NaN Suite E NaN NaN nan NaN
9 Ashtabula Geneva Community Center Geneva 44041 403 Geneva WIC Clinic
10 NaN 62 West Main Street NaN NaN nan NaN
11 Ashtabula Jefferson United Methodist Church Jefferson 44047 402 Jefferson WIC Clinic
12 NaN 125 East Jefferson Street NaN NaN nan NaN
13 Ashtabula Conneaut Human Resource Center Conneaut 44030 401 Conneaut WIC Clinic
14 NaN 327 Mill Street NaN NaN nan NaN
15 Ashtabula 3225 Lake Avenue Ashtabula 44004 400 Ashtabula County WIC Program
16 NaN NaN NaN NaN nan NaN
18 NaN NaN NaN NaN Clinic NaN
20 Ashtabula St. Mary's Catholic Church Orwell 44076 490 Orwell WIC Clinic
And what I'd like is:
Index: []
County_Name Address City Zip_Code ClinicNum ClinicName
3 Adams 9137 State Route 136 West Union 45693 100 Adams County WIC Program
6 Allen 940 North Cable Road, Suite 4 Lima 45805 200 Allen County WIC Program
7 Ashland 934 Center Street Ashland 44805 300 Ashland County WIC Program
8 NaN Suite E NaN NaN nan NaN
9 Ashtabula Geneva Community Center Geneva 44041 403 Geneva WIC Clinic
10 NaN 62 West Main Street NaN NaN nan NaN
11 Ashtabula Jefferson United Methodist Church Jefferson 44047 402 Jefferson WIC Clinic
12 NaN 125 East Jefferson Street NaN NaN nan NaN
13 Ashtabula Conneaut Human Resource Center Conneaut 44030 401 Conneaut WIC Clinic
14 NaN 327 Mill Street NaN NaN nan NaN
15 Ashtabula 3225 Lake Avenue Ashtabula 44004 400 Ashtabula County WIC Program
18 NaN NaN NaN NaN Clinic NaN
20 Ashtabula St. Mary's Catholic Church Orwell 44076 490 Orwell WIC Clinic
I am able to create the data frame I want with the correct headings but it still does not remove the NaN values. Or it removes the entire thing. I'd also like to be able to move the rows that are not all NaN values into the correlating ones so they are all one line.
I'm also not sure how reproducible I can get this as I have fiddled around with tabula quite a bit trying to get this pdf converted.