0

A 200 page file was imported from PDF to a Word document. The text came our garbled and I am trying to clean up using VBA macros.

The issue is that the text looks like this

CarrierCOM is a c a r rie r ’s c a r rie r into and ou t of Mexico. We provide a fu ll line of services including co-loca tion, private line, conversions, in te rconnections, c ro s s -b o rd e r services, in ternet, video conferencing and specialized services as required.

I need help with removing spaces that appear randomly in between words and make the output look like this

CarrierCOM is a carrier’s carrier into and out of Mexico. We provide a full line of services including co-location, private line, conversions, interconnections, cross-border services, internet, video conferencing and specialized services as required.

Any help you could provide would be appreciated. Doesn't have to be VBA, could be any other programming language/technique/software.

  • 3
    Use a better software for exporting from PDF to Docx? For example see [this](http://stackoverflow.com/questions/8370014/how-to-convert-pdf-to-html) – Siddharth Rout Sep 12 '13 at 19:08
  • 1
    There's no easy programmatic way to do this. How would you distinguish between proper spaces between words and spaces between fragments of words? You'd have to accumulate fragments and match them against a dictionary; even that wouldn't guarantee an accurate result as with e.g. "inter connections" vs. "interconnections" (both contain valid English words). Best bet is to find a better converter as Siddharth suggested. – Zebby Dee Sep 18 '13 at 15:49

1 Answers1

-1

Use Ctrl-h (search and replace). First, replace ". " (without quotes) with ".$%", which will mark all of your end-of-sentence spaces. Second, replace " " with "" (i.e., replace all spaces with nothing). Third, replace ".$%" with ". " to put back the end-of-sentence spaces. There you go; you are a programmer.

I forgot to say that during each replace, you have to choose ReplaceAll. Also, start from the beginning of the document.

dmm
  • 495
  • 5
  • 12
  • 1
    Umm.. what about the spaces between words and after commas? Your script would remove those, so you'd end up with a bunch of gobbledegook with spaces only after periods. – Zebby Dee Sep 18 '13 at 15:35
  • Ahahahah! Brain fart! Sorry. Definitely deserved a "not useful" vote. – dmm Sep 18 '13 at 19:53
  • So far I have used OCR engines from Nuance, Abbyy, OpenOCR, MS OneNote and none have worked well. Zebby Dee, clearly you know this stuff. Do you think may converting this file to XML or some other format might be a way out? – user2773882 Sep 20 '13 at 14:11