0

I'm trying to clean a pdf to turn it into a file for geocoding. I've been using tabula-py to rip the pdf and have had pretty good results up until the point of removing rows that are empty entirely. I'm not even sure if this is an efficient way to do this.

I've tried the majority of solutions SO has recommended to me and I still can't quite figure it out. I've set inplace = True, axis = 0 and 1, how="all". Tried indexing out the NaN values and that didn't work either.

import pandas as pd
import tabula
pd.set_option('display.width', 500)
df = tabula.read_pdf("C:\\Users\\Jack\\Documents\\Schoolwork\\Schoolwork\\WICResearch\\RefDocs\\wicclinicdirectory.pdf", pages='all', guess = False, pandas_options={'header': None})
df.columns = ["County_Name", "Clinic_Number", "Clinic_Name", "Address", "City", "Zip_Code", "Phone_Number", "Hours_of_Operation"]
df.drop(["Phone_Number", "Hours_of_Operation"], axis = 1, inplace = True)

#mass of code here that removes unwanted repeated column headers as by product of tabula reading PDFs.

df.drop(["Clinic_Name"], axis = 1, inplace = True)

    df[['ClinicNum','ClinicName']] = df.Clinic_Number.apply(lambda x: pd.Series(str(x).split(" ", maxsplit = 1)))

    df.drop(["Clinic_Number"], axis = 1, inplace = True)

    #df[~df.isin(['NaN', 'NaT']).any(axis=1)]

    #df.dropna(axis= 0, how ='all', inplace = True)

    NaNIndex = df.index[df.isnull().all(1)]
    print(NaNIndex)
    print(df)

The above code gives this output: 

Index: []
    County_Name                              Address                       City Zip_Code ClinicNum                                ClinicName
0           NaN                                  NaN  Ohio WIC Clinic Locations      NaN       nan                                       NaN
1           NaN                                  NaN                        NaN      NaN    Clinic                                       NaN
3         Adams                 9137 State Route 136                 West Union    45693       100                  Adams County WIC Program
4           NaN                                  NaN                        NaN      NaN       nan                                       NaN
5           NaN                                  NaN                        NaN      NaN       nan                                       NaN
6         Allen        940 North Cable Road, Suite 4                       Lima    45805       200                  Allen County WIC Program
7       Ashland                    934 Center Street                    Ashland    44805       300                Ashland County WIC Program
8           NaN                              Suite E                        NaN      NaN       nan                                       NaN
9     Ashtabula              Geneva Community Center                     Geneva    44041       403                         Geneva WIC Clinic
10          NaN                  62 West Main Street                        NaN      NaN       nan                                       NaN
11    Ashtabula    Jefferson United Methodist Church                  Jefferson    44047       402                      Jefferson WIC Clinic
12          NaN            125 East Jefferson Street                        NaN      NaN       nan                                       NaN
13    Ashtabula       Conneaut Human Resource Center                   Conneaut    44030       401                       Conneaut WIC Clinic
14          NaN                      327 Mill Street                        NaN      NaN       nan                                       NaN
15    Ashtabula                     3225 Lake Avenue                  Ashtabula    44004       400              Ashtabula County WIC Program
16          NaN                                  NaN                        NaN      NaN       nan                                       NaN
18          NaN                                  NaN                        NaN      NaN    Clinic                                       NaN
20    Ashtabula           St. Mary's Catholic Church                     Orwell    44076       490                         Orwell WIC Clinic

And what I'd like is:

Index: []
    County_Name                              Address                       City Zip_Code ClinicNum                                ClinicName

3         Adams                 9137 State Route 136                 West Union    45693       100                  Adams County WIC Program
6         Allen        940 North Cable Road, Suite 4                       Lima    45805       200                  Allen County WIC Program
7       Ashland                    934 Center Street                    Ashland    44805       300                Ashland County WIC Program
8           NaN                              Suite E                        NaN      NaN       nan                                       NaN
9     Ashtabula              Geneva Community Center                     Geneva    44041       403                         Geneva WIC Clinic
10          NaN                  62 West Main Street                        NaN      NaN       nan                                       NaN
11    Ashtabula    Jefferson United Methodist Church                  Jefferson    44047       402                      Jefferson WIC Clinic
12          NaN            125 East Jefferson Street                        NaN      NaN       nan                                       NaN
13    Ashtabula       Conneaut Human Resource Center                   Conneaut    44030       401                       Conneaut WIC Clinic
14          NaN                      327 Mill Street                        NaN      NaN       nan                                       NaN
15    Ashtabula                     3225 Lake Avenue                  Ashtabula    44004       400              Ashtabula County WIC Program
18          NaN                                  NaN                        NaN      NaN    Clinic                                       NaN
20    Ashtabula           St. Mary's Catholic Church                     Orwell    44076       490                         Orwell WIC Clinic

I am able to create the data frame I want with the correct headings but it still does not remove the NaN values. Or it removes the entire thing. I'd also like to be able to move the rows that are not all NaN values into the correlating ones so they are all one line.

I'm also not sure how reproducible I can get this as I have fiddled around with tabula quite a bit trying to get this pdf converted.

JDubs
  • 1
  • 1
  • There's a lot of code here that doesn;t seem to be relevant to the question at hand. Can you perhaps pare it down and provide a sample input and output, as in [How to create good pandas examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples), and create a [mcve]? As it stands there's not way for us to know without seeing the data – G. Anderson Jul 16 '19 at 21:46
  • Sure thing! Let me clean it up and make it a little more bearable @G.Anderson – JDubs Jul 17 '19 at 03:03
  • Could you clarify exactly what rows you are trying to remove? In your desired output it keeps some rows with some `Null` values and removes some with `Null` values, but not just all `Null` values. For example why is row 0 removed but row 8 left in? Are you trying to remove: - Any row with a Null - Any row with all Nulls - Any row with a Null in a specified column - Something else – NickHilton Jul 17 '19 at 19:32
  • @hilton92 Row 8 has part of an address that I need to find a way to merge into the row above but that's probably a different question. But yes in the example row 16 should be removed as that is all null. And row 18 should've been better cleaned by me and then removed as well as a null row. – JDubs Jul 18 '19 at 21:52
  • So it looks like you want to remove any row with a NULL address? That is a seperate problem to the problem of merging the address fields for rows where the address appears to be split – NickHilton Jul 18 '19 at 22:20
  • @hilton92 I want to remove any row with Null across the entire row. But yes, separate problem to then merge the rows. – JDubs Jul 19 '19 at 20:52
  • Row 0 doesn't have a Null across the entire row, but you removed that in your example? – NickHilton Jul 19 '19 at 22:07

0 Answers0