This is my first time using reticulate
. I have 20 multi-page pdf tables I'm pulling data from using camelot
in python (they're not simple tables so I need the more powerful table reader). It creates a list of tables (one table for each page) and makes a TableList
object. I'm able to loop over list and convert the tables to pandas dataframes. Example of doing this with one of the pdfs:
tables2001 = camelot.read_pdf('2001.pdf', flavor='stream', pages='1-end')
df2001 = list()
for t in tables2001:
df = t.df
df2001.append(df)
I can then return to r, and rdf2001 <- py$df2001
gives me a list of r data.frames.
However, if I instead put the python list of dataframes into either a nested list or a dictionary containing lists, the r conversion no longer works, and the resulting nested list still contains pandas data.frames. An attempt to manually convert one of the dfs understandably gives this:
Error in as.data.frame.default(rdf2001_nested[[1]]) :
cannot coerce class ‘c("pandas.core.frame.DataFrame", "pandas.core.generic.NDFrame", ’ to a data.frame
If I pull a single list from from a nested list into r, e.g. df2001_a <- py$df2001[1]
, that converts to a single list of r data.frames. I can't do the same for a dictionary, since the conversion keeps the key as a list so the nesting still exists.
The idea of using a dictionary was to get a named list in r identifying each year, since the tables themselves do not contain that information. I can work around it, but the dictionary to named list would to me the clearest way to do this assuming it would work. Trying nested lists was to figure out if the conversion issue only happened with dictionaries, which it doesn't; it's with any kind of nesting.
I'm trying to understand why this is happening. Can reticulate
only convert a single level of a list? Is there an underlying reason for this or is it just that that ability hasn't been added but in theory could be?
Update with full code:
Pdf tables are here. I extracted the pages covering criminal caseloads for each year which is why pages are listed as 1-end; each has 14 pages. Python code run with repl_python()
- works and gives the outcome I intend for both the list and dictionary:
import camelot
import pandas
# Lists
tables2001 = camelot.read_pdf('2001.pdf', flavor='stream', pages='1-end')
tables2002 = camelot.read_pdf('2002.pdf', flavor='stream', pages='1-end')
tables2003 = camelot.read_pdf('2003.pdf', flavor='stream', pages='1-end')
dflist = list()
tablelist=[tables2001,tables2002,tables2003,tables2004]
for t in tablelist:
df = t.df
dflist.append(df)
# Dictionary - I got help with this from someone who is knows python well
tables = { f'20{str(n).zfill(2)}': camelot.read_pdf(f'20{str(n).zfill(2)}.pdf',
flavor='stream', pages='1-end', table_regions=['50,580,780,50']) for n in range(1,3)}
dfdict = { k: [df.df for df in v] for k, v in tables.items() }
R code:
library(reticulate)
# List
rdflist <- py$dflist
# Dictionary
rdfdict <- py$dfdict
rdflist
is a list of data.frames. rdfdict
is a named nested list, containing 3 lists (2001, 2002, 2003), each with 14 pandas dataframes, i.e. not usable in r.
class(rdflist[[1]])
[1] "data.frame"
class(rdfdict[[1]][[1]])
[1] "pandas.core.frame.DataFrame" "pandas.core.generic.NDFrame"
[3] "pandas.core.base.PandasObject" "pandas.core.base.StringMixin"
[5] "pandas.core.accessor.DirNamesMixin" "pandas.core.base.SelectionMixin"
[7] "python.builtin.object"
Attempt to coerce a single df to data.frame:
as.data.frame(rdfdict[[1]][[1]])
Error in as.data.frame.default(rdfdict[[1]][[1]]) :
cannot coerce class ‘c("pandas.core.frame.DataFrame", "pandas.core.generic.NDFrame", ’ to a data.frame