2

I am using Apache PDFBox and going page by page to extract text. But at the same time I have to remove the strike-through text which comes in between.

I tried : Detect Bold, Italic and Strike Through text using PDFBox with VB.NET

but its failing for most of my scenarios.

I also tried: PDFBox delete comment maintain strikethrough

Are there any libraries present which do this?

Amedee Van Gasse
  • 7,280
  • 5
  • 55
  • 101
  • There aren't. A line that strikes through text is just that: a sequence of graphics state operators such as `moveTo`, `lineTo`, `stroke`. You are looking at a property in a font such as `font-weight` or `font-style`, but whether or not a line is drawn through text isn't a property of a font. You'll need to parse the content for lines and get the coordinates; parse the content for text and get the coordinates; then compare the coordinates of the lines and the text to discover which text matches your query. This can be done by iText, but the code to do that isn't something we can give for free. – Bruno Lowagie May 11 '18 at 08:22
  • It's also strange that your question is entirely about PdfBox, but that you tag the question as an iText question. I'm going to vote to close your question as off-topic, because Stack Overflow can't be used to ask for recommendations. See the Stack Overflow FAQ for more info. – Bruno Lowagie May 11 '18 at 08:25
  • 1
    If you had read the answer you *tried* carefully, you'd have noticed that its code specifically shows how to identify strike-through effects generated like in the sample document provided with the question it answers. If that code is *failing for most of* your *scenarios*, then the strike-through effect therein most likely is generated differently. So why don't you inspect your *scenarios* and find out how it is done there? Or, if you don't feel up to that task, share those pdfs to enable us to help you doing so? – mkl May 11 '18 at 09:09
  • Here's an answer that shows how to get the lines / shapes with PDFBox: https://stackoverflow.com/questions/38931422/ – Tilman Hausherr May 11 '18 at 10:07
  • I removed the iText tag because the question is not about iText. – Amedee Van Gasse May 11 '18 at 10:54

0 Answers0