0

I am working on pdf scanning,where I want to extract text from the PDF. I am using pdf Multithreading.pdf for searching. I am able to extract the text but am not able extract spaces from the text.I am getting only callbacks for Tj operator and not for TJ. What can be the problem?

Thanks

Swaroop
  • 501
  • 4
  • 18

1 Answers1

3

I am able to extract the text but am not able extract spaces from the text.I am getting only callbacks for Tj operator and not for TJ.

The reasons are that in your sample document

  1. no spaces are used in the text drawing operations but instead the text drawing position is changed using Tm operations; and
  2. only Tj text drawing operations are used, no TJ ones.

E.g. the text drawing operations of the title page

title on the title page

are:

BT
/F0 50 Tf
1 0 0 1 60 669.225 Tm
(\0006)Tj                                    %  T
1 0 0 1 83.527 669.225 Tm
(\000J\000T)Tj                               %  hr
1 0 0 1 125.631 669.225 Tm
(\000G\000C\000F\000K\000P\000I)Tj           %  eading
1 0 0 1 273.395 669.225 Tm
(\0002)Tj                                    %  P
1 0 0 1 298.272 669.225 Tm
(\000T)Tj                                    %  r
1 0 0 1 313.599 669.225 Tm
(\000Q)Tj                                    %  o
1 0 0 1 340.076 669.225 Tm
(\000I\000T)Tj                               %  gr
1 0 0 1 382.43 669.225 Tm
(\000C\000O\000O\000K\000P\000I)Tj           %  amming
0 Tc
1 0 0 1 60 609.225 Tm
(\000\))Tj                                   %  G
1 0 0 1 91.7 609.225 Tm
(\000W\000K\000F\000G)Tj                     %  uide
ET  

No white space in the Tj text drawing operations, only shifts in the drawing position using Tm.

mkl
  • 90,588
  • 15
  • 125
  • 265
  • so how should i separate the words by space? or how should detect the space using Tm? – Swaroop Nov 20 '14 at 15:15
  • Basically I want the text data separated by space(words). What is the way to achieve this? – Swaroop Nov 20 '14 at 15:24
  • Unfortunately I don't know the CGPDFScanner well. Essentially you'll need the width of the string drawn by a **Tj** operation. Having that you can calculate whether the following **Tm** operation moves just a little bit (kerning) or much (space). – mkl Nov 20 '14 at 16:58