I need to identify (but not extract) nonselectable image from PDF. I went through the thread 'Extract images from PDF without resampling, in python?' but didn't get a solution. If I'm correct, PyMuPDF can only identify the selectable images (when you click it the image gets shade). Please refer to the line chart in 61899345.pdf in the link as an example of nonselectable image. Because I have a large number of such files to process, I guess I have to find a rule to define an image. Thank you.
Asked
Active
Viewed 180 times
0
-
Your link does not work? Have you probably forgotten to make it publicly accessible? – mkl Oct 04 '19 at 09:17
-
@mkl Sorry. I just fixed it. Could you please try it again? – DAQI XIN Oct 04 '19 at 13:19
-
The line chart in 61899345.pdf is not a (bitmap) image at all, it is a mixture of vector graphic elements (lines, diamond shapes) and text. Thus, it cannot be *extracted* as bitmap image, you can merely *render* it to a bitmap. You still have to somehow find the actual dimensions of what you consider a connected chart, though, as in a PDF there does not need to be a mechanism bundling those vector graphics and text elements... – mkl Oct 06 '19 at 14:12
-
@mkl Thank you for your explanation. I don't really need to extract it--identifying images is what I need. When you say render, do you mean chop the area I consider an image and save it as a bitmap? Is it possible to recognize the vector graphic elements? – DAQI XIN Oct 07 '19 at 00:18
-
*'When you say render, do you mean..."* - yes. *"Is it possible to recognize the vector graphic elements?"* - I don't know the features of the common python pdf libraries but conceptually it should be possible. The problem is, though, that vector graphics elements also are used to e.g. draw draw borders and colored background of tables, text underlines, text box background, and some other artifacts. Thus, you have to somehow recognize their usage. – mkl Oct 07 '19 at 04:44
-
@mkl Thank you again. It seems like a difficult task to distinguish the usage of those vector graphic elements. – DAQI XIN Oct 07 '19 at 18:46