1

I have a PDF document reporting COVID-19 numbers for counties in a state in a table in the document. I am reading the table into a pandas dataframe using camelot, and I am extracting the values in various rows based on the value, the county's name, in the first column. For that, I am using Boolean indexing as described here: How do I sum values in a column that match a given condition using pandas?

I am using the data extracted to report on the COVID-19 statistics in a subset of counties listed in the report that are of interest to my organization. I am also extracting the total numbers for the state, but the producers of the PDF cannot decide if they want to call that row of data "Gesamt" ("Total") or "Gesamtergebnis" ("Total result"). The dataframe I am working with, after camelot extracts the table from the PDF, looks like this:

                0        1       2        3
...
9        A County   13.789   (+22)  1.566,0
10      My County   16.581   (+45)  3.040,0
11   Their County    7.445   (+15)  2.821,6
... 
55         Gesamt  304.950  (+820)  2.747,2

The code below works, if they use "Gesamt." I would like to write it so that it will also work if they use "Gesamtergebnis." I cannot rely on the total ("Gesamt" or "Gesamtergebnis") always being in the same row.

# Open LGA reports for yesterday and the day before
# TO DO: Sometimes the LGA report is named COVID_Lagebericht_LGA_yymmdd.pdf or it ends in _01
#        Add in a try/else statement to compensate for this
rptyes = f'Reports_LGA/{yday_yymmdd}_COVID_Tagesbericht_LGA.pdf'
rptdbf = f'Reports_LGA/{daybef_yymmdd}_COVID_Tagesbericht_LGA.pdf'

# Read the LGA reports into dataframes.
dfyes = camelot.read_pdf(rptyes, pages='2', flavor='stream')
dfdbf = camelot.read_pdf(rptdbf, pages='2', flavor='stream')

# Extract the statewide 7-D-I
# TO DO: Sometimes the last line says "Gesamt", sometimes "Gesamtergebnis" or something else.
#        Add in some sort of error checking or try/else statement or regular expression to compensate
landindexyes = lambda land: dfyes[0].df.loc[dfyes[0].df[0] == land].index[0]
landindexdbf = lambda land: dfdbf[0].df.loc[dfdbf[0].df[0] == land].index[0]
land = 'Gesamt'
bwname = 'Baden-Württemberg'
bwcases = int(dfyes[0].df.loc[landindexyes(land), 1].replace('.',''))
bwcasesdiff = dfyes[0].df.loc[landindexyes(land), 2]
bwdeaths = int(dfyes[0].df.loc[landindexyes(land), 4].replace('.',''))
bwdeathsdiff = dfyes[0].df.loc[landindexyes(land), 5]
bw7diyes = float(dfyes[0].df.loc[landindexyes(land), 7].replace(',','.'))
bw7didbf = float(dfdbf[0].df.loc[landindexdbf(land), 7].replace(',','.'))
bw7didiff = bw7diyes - bw7didbf
rptrowsbw = [bwname, bwcases, bwcasesdiff, bwdeaths, bwdeathsdiff, bw7diyes, bw7didbf]

How can I use a regular expression to match either "Gesamt" or "Gesamtergebnis" in the variable passed to the lambda expressions 'landindexyes' and 'landindexdbf'?

If regular expressions are not the way to go, I am open to other suggestions. I am thinking that an if/else might work, but I don't think that would be as elegant.

  • I believe you can use the [extract](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.extract.html) or [contains](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.contains.html) methods – Paulo Marques Feb 12 '21 at 19:58
  • The contains() method may work, as @Ena has suggested, but I don't understand how to work it into the [landindexyes(land) portion of the code. – T Sean Schulze Feb 13 '21 at 13:06

1 Answers1

2

Unfortunately, I can't see your data frame, so I can't write 100% correct lines. I would like to refer you to the first answer here: Filtering DataFrame by finding exact word (not combined) in a column of strings.

So, in your case something like:

df[df["column_name"].str.contains(r'(?:\s|^)Gesamt(?:\s|$)')]]==True 

or

df[df["column_name"].str.contains(r'(?:\s|^)Gesamtergebnis(?:\s|$)')]]==True 

If you are not sure that the spelling is correct in your dataset, you can try matching algorithms, such as Fuzzy Wuzzy: https://www.datacamp.com/community/tutorials/fuzzy-string-python.

Edit (from comments): RegEx slows code a lot, so what about an idea to change all "Gesamtergebnis" values into "Gesamt" in the column? So, you can use something like this in your TODO part:

df_name['column_name'] = df_name['column_name'].str.replace('Gesamtergebnis','Gesamt')

And continue with your code.

Ena
  • 106
  • 7
  • I edited my original question to add an excerpt of the dataframe I am working with. Since the row I am interested in is the only row where the value in that column begins with "Gesamt". I tried: ``` bwcases = int(dfyes[0].df.loc[dfyes[0].df.loc[dfyes[0].df[0].str.contains(r'(^)Gesamt')], 1].replace('.','')) ``` But it gives the following error: ValueError: Cannot index with multidimensional key – T Sean Schulze Feb 13 '21 at 13:47
  • RegEx slows code a lot, so what about an idea to change all "Gesamtergebnis" values into "Gesamt" in the column? So, you can use something like this in your TODO part: df_name['column_name'] = df_name['column_name'].str.replace('Gesamtergebnis','Gesamt') and continue with your code. – Ena Feb 14 '21 at 10:38
  • Additionally, you can replace also 'gesamt' with 'Gesamt' and etc and do data cleaning this way. – Ena Feb 14 '21 at 10:41
  • 1
    I will try the str.replace() method. I finally got it working using a non-RegEx solution. I noticed that most often the Gesamtergebnis and Gesamt row was the second from last row, so I assigned (df_name[0].shape[0] - 2) to a variable and used the variable as the row coordinate. I think your suggestion of replacing the value with 'Gesamt' would be more robust, though. – T Sean Schulze Feb 14 '21 at 14:12
  • The hard brackets don't match in this answer. I see one more "]" than should have. Also doesn't work for me b/c it creates NA index value error. – pauljohn32 Jan 31 '23 at 20:46