2

I am trying to write a program that will parse PDF files. From what I've read so far, PDF files are a combination of text data and binary data.

For example, within the PDF, there are "stream" objects which start with the word stream(newline), then contain the necessary binary data, and then end with the word endstream(newline)

Here is an example:

677 0 obj
<</Length 2821
/Filter /FlateDecode>>
stream
xœ…ZÛnG}÷WLô’, Oú~AÙ 6—AGbKœ5Éf†¦õ™ù£=M™dŸÖX›‡ÄÏô¥êTÕ©j¸yLœ“·¡5J»æfõæÇ_T#M+ðOssÿ懫÷ëë57ÿ}óï›7nÞH[c¼£o~Ø»©Oûñúª„FÝ-yùÆýöv“VP9Ù¯ª5çu*1!´R‡j±iجú48­}k…¶ŒûeìvwC?•71B´2hÁÈ~WBpYeUµØôØÏåÑŒµ­¦Zh¸/!Þ´ÿbȺßNis_Ê‹¯1²u'ÜÕɼmiÞ³óþp†ÞtŸRs×éúª™º~Õ¼»-.Þ\ͧ_¾[\E„VœWùã¾9¬»ù]³JÝØüÖ6«ïMãí¦{:-û¡ë§´MÝþËu³î6÷M¿›‡/ý]7§U»°¾‹²U—S’o3§œ”A›4ÏäZi=h"
Ãëþn]E¨V™(U‘IùÖŽ0wÃ~d2)£¡w;ö»‡©äœÒ­v±:ÿ<”ìϨÍk§ÒÎã'_üaøœÆÝPËhÑ*ªã3áLh«OÞ•q?Œó˜¦òvÙ@ÇX½ŸXë@ó"6·iî·ijº]3Œ«´L_¯^˜þ;ø¶yjºÍ¡{š–ñFtÿyhvÃü¼üÔôÓ´OÍý8l˜¶Ùö»~‡¦w‹!n.Œÿ;!Fö»Õ°¿ƒ7O?5·ûùŸÅï„/˜ü~Wf2©Tk£Ê¡D<m”Ùò%ˆè"½k~® ë4¥n…õ–Qç[‘m«•1Œ\•[ªèZc£bŒ@DW­4¦²âf—÷ëvUzU^ÅW(1pC1&/ÔÍû‘k2åK$QÞ
äØ(Íkžˆj£¸õÂy‘!Ö낚¦qÝ=–V…)àHïø4ìÉ“fµŠ1]NFßjªuÆþa=—©Ti‹_fØï ù·Ü j¯ÉW[ƒ¤Dèïg
\m\®–Õ’Xð©ùØJ!"ƒæy!Ny   †Â0ð°NÄPÛi_3š¶…áÌkF;»ÒºK>˜úy†7©¶À™ZhÆÝ&$ºTÅ$2‹aXGŽeœ°ŒxR2µQ8Ï ¹ƒà)}£ŒUÃ~üÅðšÓœÞvoóJgðÉ™úSÖl‡~.K¢ÖÁã·/§Q.¬s‘÷(‡»Õ?ï×ßQÌzœ]W†ÿXÑÀj©j“—nÁZtå°€ê\mºrÊc7¾]âꂸd¶‰VhíeÐIUýT"] 2F~$À˜9OtzÊT©dÈ2Ä1¬'ʸ51VÊÛÈ0
¹ìÛÇ<mQç«}kJíA·^Cé&íº¢––­UºÚŒ#‹’c=c¨¼ÁЭ
ªÚmÚ?>¦‘ôwÖ­Á™j)ª§·SVTÛÍ\OMÈB>ÊoÉ„LïXm5=¦îÓ“ 犲ÜtÍý>—£,žWý®Û,ê‹Ð?—”«ÿ¤¹JBÀ8í©–h4KÈ1yìHŠIga²œƒJPˆTMBîÐ"ƒ²Í(ÁÙB Z*Ë}jÌ4ì–£¡ÒŸRÎf¡8   ËîÌÔÐ*zÆŒ   Öå¶+÷"¾²)   $ƒZ(¤ª,Z¥(\Ř¯-cyt+Á²¶ ýNÜ0!^
Ò©«\äƒñªÐ©¥ÍæBxuì«Ð1MÍéëÉ*-ꃴwQ¨¿¦ãg«´ÝcZ-vqFÇB›â‹¾8ŒÃîá:+âÝò6Puæþ:²¢ ô    ‰®¤¬ÜÂA`ZgRW4¿R=>{™?áJvº-Q×å‡ó”ŠÐ,ê“0SJe²×(e‰šAµD²2<ƒnÓ]·'1n Y<Šã`tÖ±¨üèP-£hC“S¬AÕã£çnjÍÚæLJa/¤L_汎‰ŠTœD4(­MhwHŸÓdÞ¹„`W*Iˆ(BÁ m—›uR«?G(vÂ}¤ˆoTu$Ú-d%XŸº¨¹Öæø%Ðã0M=„å.êr9Â’qÕ©"!óP—r zHëcoC˜‰„œ#„÷ÕÁO‰aÁÓ…ìœT¸ÈK´?ZC2†CNÇPºÂäüP¢‚D`¢s 3á­2yVp”¢zÉæ1A«¾&‡ú°ŒADSÉ*V¨W a(QÇnt´(^½r=#sÛQ[ ê]uÓðliÀ_\>¾²Yî[…Õ•-™(gÿgœd¹ÍßÁÜ
Dʺß.–!0¼T„¿òñÞ5ó·é¹24ÏÍй½kr@4ȕ۩»œÁPxÆÕòòº˜g^ý   Ø@<‘‚7†q[j†¤ËÁ”‰Ë˜1mÊ W(Ðp‚a]¹›½ÁLÅõãa¿éÆ~~j)k {mý‰‰óë<ª!L)-“¬-1¯‡=ºwÊ*ÐJžÏÆYÄÌš†
ñDIEäddÈj?(g9ç+3°|Šh=¬¯ýÒ}Zê°µ*fÑy–w;¥ñs7÷î9N_Á§ØÒôÛÇaœ»Ý¼LQΩÿ ¦RÊÜ¡¢ÈèñåÄÅç–IjÆUè_´(‘1œ”mÔÎ2æ8q¡'Œ,€µ·Õá§¹Ûf~®êì¤cuƒ‹TnÉÑÐ!T·¨†âÆ"•! 1(Ϫ¾Ÿ*Oc
ÃŽóÐWæW”åN3{    Ý#¦î·¤þX«4qœÍ%dkŒh} Ã…¨Fá(.&r0äyÐK1•¼¯®V•tåp꜅ùD¨UŸy¶¢PÕÀB‚Uó«ã$_[kÇ“|ü…~ÅÞ:äÒªëq#ód\Ô>yÙHç¤cåê³€×¶òȘР—Úü"¦_x…my¦’燤€Ïýfqb®lñ,põ±ÚfÕ‚^.0¦
h¸XgyG˜uGƒw¨[pN0&¿T¥²$)ˆ˜‡-BJ1d»ç'&„¦³º:4±W9„xN°áW¨˜cIVg~Ù)Z…ÜWW‡>"þY§-Oç"®øâüTð‹‹²:}Õùçy»ñõùéliø¦tñˆóõU07»ý=„Éôü<±Âÿ|:—‰»´X'rn¾$E½ S##CèC*(\¡=C¨ÌJÔP‹ÁŠåÒç­¬b&hÆFl;Ö,yêcXÚå_°ÎU—‚¶Û·Äô°1ãhˆî:U[ð¶ŸÖÃ#  4˜6ÆÕ½úo¹…ö_˜\æYëð‡Ü±wÛªˆ…£Ô„ÔlUí0˜ˆ2\DÈ-qå²´[z;W¢xØK_îöSž¿/Ž>d(ßò~&&A¢÷7ŒaßIÅ
aò,8Ñù3W|Îä„[ž£)Èa¡Á<ÂÖ
=ŒTÊ2ˆt)Œàp6F¬;.xÇq‘¯ìph§½Çœ®ÄÇC˜¯£6Ö+(E&fy7t•Ã%ú¸P‡‰6ÿ­X;èã!ŽO7çE¶MwNÌÃ6­r4Ü&œÙ¨És¯§CžØ,&&iËçÀ¿‡}s@l=ÖëžÔî›™âÆ¡ì$/Gì‚“Ñ=1¨NTδ¨Õfµ"†F™¦´Û°ßÐÛ•Îi!û¹Dñ1ª,:gñb³±â¦·¼zîæÑ¤@2Š
UÎSZ™ÈçG©Ÿˆ|Ø0@Ìó©ÿmŸ<0(ŒâÁ•ÈIJrÊ&u¿6ä.-V^yñ x枢‡Ã¯QA³"ByÅ<&hÑ–˜çÄ
ã¨KbuþôÁÿ ŒÃï
endstream

(Note: There is also some header information before the "stream" word, telling you how long the binary data is, etc.)

What I have been doing is using a StreamReader to read the text data, and then I want to switch over and just read the binary data. Something like this:

using (FileStream fs = new FileStream(myFile, FileMode.Open))
{
   fs.Seek(250205, SeekOrigin.Begin);  // This jumps to the start of the data above
   using (StreamReader sr = new StreamReader(fs))
   {
      String line = "";
      while ((line = sr.ReadLine()) != "stream");   // This skips to the end of "stream"

      // I want this to read the 2821 bytes AFTER stream
      byte[] buffer = new byte[2821];
      fs.Read(buffer, 0, buffer.Length);
   }
}

When I run this, buffer does not contain the 2821 bytes after "stream". Instead, it contains data that is 1024 bytes AFTER the original Seek() position.

In other words: I Seeked to position 250205. Then, I read a few lines (a total of 57 bytes). But my buffer starts at position 251229 (inestead of 250262).

How can I get the FileStream to start reading where the StreamReader left off?

(My main reason for doing it this way is because FileStream does not have a ReadLine() method and StreamReader does not have a byte[] = Read() method.)

user807566
  • 2,828
  • 3
  • 20
  • 27
  • Please do yourself a favor and don't try to use a `StreamReader` with PDF files. Even if pdf contents partially look like a text format, the are not. Thus, you'll only shoot yourself in your foot this way. Study the pdf specification, existing PDF documents, and maybe also other pdf processing libraries. – mkl Nov 05 '13 at 22:33

2 Answers2

0

For performance issues, StreamReader doesn't read the exact amount of bytes you need into a Stream when you call a Read/ReadLine method.

It first reads a fixed amount of bytes, stores them into an internal buffer and then gets the bytes you need from that buffer.

When all data stored into the internal buffer have been read, a new Read from the same fixed amount is made and so on...

Of course this fixed amount matches the internal buffer size which could be according the msdn StreamReader constructor description:

This constructor initializes the encoding to UTF8Encoding, the BaseStream property using the stream parameter, and the internal buffer size to 1024 bytes.

So even if your ReadLine calls returned a sum of 57 bytes read, at least 1024 has been read from your FileStream by your StreamReader...

Among others, one solution could be to count the actual read bytes returned by your ReadLines calls (by summing line.Length into your loop). And then using this sum variable to Seek once again from the beginning of your Stream with fs.Seek(250205 + sum, SeekOrigin.Begin);.

That being said, i totally agree with what has been said in your question comments. Maybe you will get some surprises by using a StreamReader (which is designed to read text) to read a PDF file. For instance, will your line length property content always be what is expected, perhaps because of encoding issues?

AirL
  • 1,887
  • 2
  • 17
  • 17
0

If you need to get and set the real position from a StreamReader see this https://stackoverflow.com/a/22975649/718033

Community
  • 1
  • 1
Eamon
  • 1,829
  • 1
  • 20
  • 21