python to search within pdf file

Question

here is part of pdf structure:

5 0 obj
<< /Length 56 >>
stream
BT /F1 12 Tf 100 700 Td 15 TL (JavaScript example) Tj ET
endstream
endobj
6 0 obj
<<
/Type /Font
/Subtype /Type1
/Name /F1
/BaseFont /Helvetica
/Encoding /MacRomanEncoding
>>
endobj
7 0 obj
<<
/Type /Action
/S /JavaScript

I want to search for "javascript" if its there or not. the problem with it that javascript can be represented by its hex as a whole or part ot it "javascript or Jav#61Script or J#61v#61Script and so on"

so how could I find out if javascript is exist with all of this possibilities ????

possible duplicate of [pyhton to search for javascript or hex representation](http://stackoverflow.com/questions/23021976/pyhton-to-search-for-javascript-or-hex-representation) — Jongware, Apr 11 '14 at 23:04

ooga · Accepted Answer · 2014-04-12T01:08:26.363

1

Read it in a character at a time and translate any hex you find to characters as you go, also translating to lowercase. Compare the result to "javascript".

Here's an idea:

import string
import os
import re

def pdf_find_str(pdfname, str):
  f = open(pdfname, "rb")

  # read the file CHUNK_SIZE chars at a time, keeping last KEEP_SIZE chars
  CHUNK_SIZE = 2*1024*1024
  KEEP_SIZE = 3 * len(str) # each char might be in #ff form
  hexvals = "0123456789abcdef"

  ichunk = removed = 0
  chunk = f.read(CHUNK_SIZE)
  while len(chunk) > 0:

    # Loop to find all #'s and replace them with the character they represent.
    hpos = chunk.find('#')
    while hpos != -1:
      if len(chunk)-hpos >= 3 and chunk[hpos+1] in hexvals and chunk[hpos+2] in hexvals:
        hex = int(chunk[hpos+1:hpos+3], 16)  # next two characters are int value
        ch = chr(hex).lower()
        if ch in str: # avoid doing this if ch is not in str
          chunk = chunk[:hpos] + ch + chunk[hpos+3:]
          removed += 2
      hpos = chunk.find('#', hpos+1)

    m = re.search(str, chunk, re.I)
    if m:
      return ichunk * (CHUNK_SIZE-KEEP_SIZE) + m.start()

    # Transfer last KEEP_SIZE characters to beginning for next round of
    # testing since our string may span chunks.
    next_chunk = f.read(CHUNK_SIZE - KEEP_SIZE)
    if len(next_chunk) == 0: break
    chunk = chunk[-KEEP_SIZE:] + next_chunk

    ichunk += 1

  f.close()
  return -1

# On one file:
#if pdf_find_str("Consciousness Explained.pdf", "javascript") != -1:
#  print 'Contains "javascript"'

# Recursively on a directory:
for root, dirs, files in os.walk("Books"):
  for file in files:
    if file.endswith(".pdf"):
      position = pdf_find_str(root + "/" + file, "javascript")
      if position != -1:
        print file, "(", position, ")"
# Note: position returned by pdf_find_str does not account for removed
# characters from #ff representations (if any).

edited Apr 12 '14 at 01:08

answered Apr 11 '14 at 22:26

ooga

15,423
2
20
21

you mean when I search the whole file I should read every word and that word should be in char representation then I compare it to "javascript" do you mean that ???? – user3461464 Apr 11 '14 at 22:30
@user3461464 Hmmm, when you put it that way, it seems a little crazy. Let me think a bit. Is the word "javascript" guaranteed to be near the beginning? – ooga Apr 11 '14 at 22:33
please I will be grateful to you, really I need it – user3461464 Apr 11 '14 at 22:36
@user3461464 I have a feeling there's a much nicer solution (perhaps involving codecs?) but I've edited my answer with an idea. – ooga Apr 11 '14 at 22:54
@user3461464 I fixed a bug that I noticed in calculating the hex values. See revised code above. – ooga Apr 11 '14 at 23:02
Thank you very much for your time and your help . I think I need time to read and understand it but why you choose 1000 byte to read at a time ?? – user3461464 Apr 11 '14 at 23:06
@user3461464 It's an arbitrary number. Should probably be higher. I've generalized it and set it to 25000 now (see above). – ooga Apr 11 '14 at 23:13
its nice, I can use it when I want to determine the version of pdf as it is located within the first 1024 byteof the pdf file but could I use it like this : chunk = f.read() ?? – user3461464 Apr 11 '14 at 23:21
@user3461464 Actually, the idea is to read that many characters at a time, but go through the whole file. However, even with the latest updates to the code above, it doesn't really seem to work! Or at least it's extremely slow. – ooga Apr 11 '14 at 23:27
Thank you again for your help and time i appreciate it very much – user3461464 Apr 11 '14 at 23:30
@user3461464, Check out the latest version above. It actually seems to work now! – ooga Apr 11 '14 at 23:41
I will execute it and try it on many samples .... really thank you very much for your help and kindness excuse me I'm working on a project so it it possible to contact you if I need any help if that doesn't bother you in any way you prefer ? – user3461464 Apr 11 '14 at 23:49
@user3461464 I found and fixed another little bug running it recursively on my pdf collection. Only a handful have "javascript" in them. Unfortunately it's not possible to contact me, but post here anytime. – ooga Apr 12 '14 at 00:11
@user3461464 Updated code to work better (finds more files using regex for case-insensitive compare) and to show the (possibly approximate) position of the string. – ooga Apr 12 '14 at 01:09

python to search within pdf file

1 Answers1

Linked