2

This is my first time using reticulate. I have 20 multi-page pdf tables I'm pulling data from using camelot in python (they're not simple tables so I need the more powerful table reader). It creates a list of tables (one table for each page) and makes a TableList object. I'm able to loop over list and convert the tables to pandas dataframes. Example of doing this with one of the pdfs:

tables2001 = camelot.read_pdf('2001.pdf', flavor='stream', pages='1-end')
df2001 = list()
for t in tables2001:
  df = t.df
  df2001.append(df)

I can then return to r, and rdf2001 <- py$df2001 gives me a list of r data.frames.

However, if I instead put the python list of dataframes into either a nested list or a dictionary containing lists, the r conversion no longer works, and the resulting nested list still contains pandas data.frames. An attempt to manually convert one of the dfs understandably gives this:

Error in as.data.frame.default(rdf2001_nested[[1]]) : 
  cannot coerce class ‘c("pandas.core.frame.DataFrame", "pandas.core.generic.NDFrame", ’ to a data.frame

If I pull a single list from from a nested list into r, e.g. df2001_a <- py$df2001[1], that converts to a single list of r data.frames. I can't do the same for a dictionary, since the conversion keeps the key as a list so the nesting still exists.

The idea of using a dictionary was to get a named list in r identifying each year, since the tables themselves do not contain that information. I can work around it, but the dictionary to named list would to me the clearest way to do this assuming it would work. Trying nested lists was to figure out if the conversion issue only happened with dictionaries, which it doesn't; it's with any kind of nesting.

I'm trying to understand why this is happening. Can reticulate only convert a single level of a list? Is there an underlying reason for this or is it just that that ability hasn't been added but in theory could be?

Update with full code:

Pdf tables are here. I extracted the pages covering criminal caseloads for each year which is why pages are listed as 1-end; each has 14 pages. Python code run with repl_python() - works and gives the outcome I intend for both the list and dictionary:

import camelot
import pandas

# Lists
tables2001 = camelot.read_pdf('2001.pdf', flavor='stream', pages='1-end')
tables2002 = camelot.read_pdf('2002.pdf', flavor='stream', pages='1-end')
tables2003 = camelot.read_pdf('2003.pdf', flavor='stream', pages='1-end')

dflist = list()
tablelist=[tables2001,tables2002,tables2003,tables2004]
for t in tablelist:
  df = t.df
  dflist.append(df)
  
# Dictionary - I got help with this from someone who is knows python well
tables = { f'20{str(n).zfill(2)}': camelot.read_pdf(f'20{str(n).zfill(2)}.pdf',
flavor='stream', pages='1-end', table_regions=['50,580,780,50']) for n in range(1,3)}

dfdict = { k: [df.df for df in v] for k, v in tables.items() }

R code:

library(reticulate)

# List
rdflist <- py$dflist

# Dictionary
rdfdict <- py$dfdict

rdflist is a list of data.frames. rdfdict is a named nested list, containing 3 lists (2001, 2002, 2003), each with 14 pandas dataframes, i.e. not usable in r.

class(rdflist[[1]])
[1] "data.frame"
class(rdfdict[[1]][[1]])
[1] "pandas.core.frame.DataFrame"        "pandas.core.generic.NDFrame"       
[3] "pandas.core.base.PandasObject"      "pandas.core.base.StringMixin"      
[5] "pandas.core.accessor.DirNamesMixin" "pandas.core.base.SelectionMixin"   
[7] "python.builtin.object"  

Attempt to coerce a single df to data.frame:

as.data.frame(rdfdict[[1]][[1]])
Error in as.data.frame.default(rdfdict[[1]][[1]]) : 
  cannot coerce class ‘c("pandas.core.frame.DataFrame", "pandas.core.generic.NDFrame", ’ to a data.frame
Abigail
  • 370
  • 1
  • 11
  • 1
    Please show the code of how you *put the python list of dataframes into either a nested list or a dictionary containing lists*. Please also show the code that raises that error. In fact, provide the full code block with all `library` lines for an [mcve]. – Parfait May 30 '22 at 18:16
  • @Parfait I updated the post with the requested code. If more is needed I can provide that as well. – Abigail May 30 '22 at 21:53
  • The dict version uses an additional argument compared to list: `table_regions=['50,580,780,50']`. – Parfait May 30 '22 at 23:56

1 Answers1

1

Comparing both versions, you run a couple of differences for the dictionary version including an additional argument, table_regions and an extra nested looping in the dictionary comprehension: [df.df for df in v] (interestingly did not raise an error in Python).

Consider adjusting for consistency for comparable returned values. By the way, in Python, you can also run list comprehension similar to dict comprehension.

Python

import camelot 
import pandas as pd

# LIST COMPREHENSION
pydf_list = [
    [tbl.df for tbl in camelot.read_pdf(f'{yr}.pdf', flavor='stream', pages='1-end')]
    for yr in range(2001, 2004)
]

# DICT COMPREHENSION
pydf_dict = {
    str(yr): [tbl.df for tbl in camelot.read_pdf(f'{yr}.pdf', flavor='stream', pages='1-end')]
    for yr in range(2001, 2004)
}

R

library(reticulate)

reticulate::source_python("myscript.py")

# NESTED LIST 
rdf_list <- reticulate::py$pydf_list 

# NESTED NAMED LIST 
rdf_dict <- reticulate::py$pydf_dict

However, as you indicate I do reproduce the problematic dict conversion to named list using a reproducible example. Reporting this issue, one suggestion of maintainer is to use py_to_r:

rdf_dict2 <- lapply(rdf_dict, function(lst) lapply(lst, py_to_r))
Parfait
  • 104,375
  • 17
  • 94
  • 125
  • In both cases, the python code gives me the error `AttributeError: 'TableList' object has no attribute 'df'`, since each page is read in as a separate table and the file ends up as a list. The `table_regions` argument was there because a few of the 20 pdfs needed it. None of the sample 2001-2003 did however, so I removed it from the dictionary code and the result in R is the same. – Abigail Jun 01 '22 at 15:04
  • Whooops! Yes, I see now it is nested. See update running nested comprehensions. – Parfait Jun 01 '22 at 16:19
  • Alright so running what you have there, I still end up with pandas dataframes in both of the R lists. I think this is probably a limitation of the py to r conversion that has to be worked around which I can do. I wish I understood why though. – Abigail Jun 01 '22 at 16:59
  • Sorry, I didn't add the converter. Try `reticulate::py$pydf_list` and `reticulate::py$pydf_dict`. – Parfait Jun 01 '22 at 17:23
  • Still the same result. – Abigail Jun 01 '22 at 17:31
  • How about when specifying individual data frame elements: `reticulate::py$pydf_list[[1]][[1]]` or `reticulate::py$pydf_dict["2001"][[1]]`? We can then run an `lapply` conversion. – Parfait Jun 01 '22 at 17:34
  • It's still recognized as a pandas dataframe even when only pulling that. When I clicked view it gave me an error: `Warning message: In py_to_r.default(object$to_string(max_rows = 150L, show_dimensions = FALSE)) : Object to convert is not a Python object` – Abigail Jun 01 '22 at 17:48
  • 1
    I do replicate your issue on dummy random data. I add a workaround but may be data dependent. But this can be a [good ticket](https://github.com/rstudio/reticulate/issues) for reticulate authors with [reproducible example](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples). – Parfait Jun 01 '22 at 21:15
  • Inspired by your post, I reported the [issue](https://github.com/rstudio/reticulate/issues/1221) and maintainer suggested to convert with `py_to_r`. See my edit. – Parfait Jun 02 '22 at 13:15
  • Your new code still gives me an error that it can't convert: `> rdf_dict2 <- lapply(rdf_dict, py_to_r) Error in py_to_r.default(X[[i]], ...) : Object to convert is not a Python object`. I ran it using the sample data in your issue post and the lapply did work for that so it's something about my data specifically. Weird. – Abigail Jun 02 '22 at 23:35
  • Just remembered, you have a nested object. So run nested `lapply`. See edit – Parfait Jun 02 '22 at 23:47
  • Ah it's working now! Thank you for taking the time to figure this whole thing out. – Abigail Jun 03 '22 at 16:06