0

I'm using ItextSharp to read data from a pdf. Inspecting the resulting string looks correct, however string.Replace fails to replace text.

Therefore, I'm guessing this is some sort of encoding issue, but I'm failing to pin it down.

My code to import the text from PDF should convert into UTF8

 PdfReader pdfReader = new PdfReader("file.pdf");

                for (int page = 1; page <= pdfReader.NumberOfPages; page++)
                {
                    ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
                    string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);

                    currentText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText)));
                    text.AppendLine(currentText);
                }
                pdfReader.Close();

Then I am trying to replace three hyphens and a space (-- -) into just 3 hyphens (---)

input = input.Replace("-- -­", "---");

Removing the utf8 conversion from the PDF import does not make a difference (see screenshot below - breakpoint after the replace function, but the text is still there):

Shows the result of the string replace in the text visualiser

EDIT:

Here is a link to a sample file. When opened in notepad or ++, it displays a series of spaces and hyphens (see npp screenshot with whitespace rendering). However when read in c# this file does not get interpreted as unicode hyphen and Unicode space. enter image description here

Neil P
  • 2,920
  • 5
  • 33
  • 64
  • 4
    See [this](http://stackoverflow.com/a/10191879/231316) for why you should get rid of the entire line `currentText = Encoding.UTF8.GetString(ASCIIEncoding.Convert...`. That line is always, always wrong and at best does nothing and at worst destroys data. – Chris Haas Oct 13 '15 at 20:50
  • Thanks, but my String.Replace still fails. – Neil P Oct 14 '15 at 08:24
  • Please see edit, I have uploaded a sample txt file that demonstrates the issue. – Neil P Oct 14 '15 at 09:21

1 Answers1

0

It turns out that either ITextSharp or the source PDF is using something called a soft hypen to represent a standard hypen, so whilst notepad, notepad++ and Visual studio text visualiser all render the soft hypen as a standard hypen, they are not the same character and that is why String.Replace does not perform any replacements.

From my understanding of a soft hyphen, in normally should not be rendered, which was causing odd behavior when trying to paste the character into a web browser or other programs such as charmap - or even visual studio itself.

This resulted in the following working code:

input = input.Replace("­­ ­", "---");

On Firefox, this renders as replacing a space with three hyphens, however pasting into notepad displays (which shows my real intention).

input = input.Replace("-- -", "---");

https://en.wikipedia.org/wiki/Soft_hyphen

Soft Hyphen: http://www.fileformat.info/info/unicode/char/ad/index.htm

Hyphen (standard hyphen) http://www.fileformat.info/info/unicode/char/2010/index.htm

My solution was to add the following line:

        input = input.Replace((char)173, '-');

tl;dr: Character encoding was absolutely fine, not all hyphens are equal.

Neil P
  • 2,920
  • 5
  • 33
  • 64