How to get numbers from pdf if thousands are separated

Question

PDF files contain quantity, price and sum. Different pdfs have different columns. In some pdfs thousands are separated by spaces like

Description       Price   Quantity          Sum
Soap           1 000.00        2.2     2 200.00
White 3 towel     10.00          2        20.00

How to get proper price and sum values? Tried iText 7

MemoryStream pdfStream = get pdf file contents
StringBuilder processed = new();
pdfStream.Position = 0;
using var pdfDocument = new PdfDocument(new PdfReader(pdfStream));
var strategy = new LocationTextExtractionStrategy();
for (int i = 1; i <= pdfDocument.GetNumberOfPages(); ++i) {
  var page = pdfDocument.GetPage(i);
  string text = PdfTextExtractor.GetTextFromPage(page, strategy);
  processed.Append(text);
  }

It returns all words separated by single space:

Soap 1 000.00 2.2 2 200.00
White 3 towel 10.00 2 20.00

Some rows contains vales less than 1000 and some more that 1000. It looks like it is not possible to get proper values from text only. How to get distance between words in row? If distance is single space, those words can merged into one number.

Using .NET 7.0 ASP.NET MVC controller.

Update

Tried XpdfNet from answer but got exception

System.IO.FileNotFoundException: Could not find file 'C:\myapp\bin\Debug\net7.0\5db7d64c-e1c5-4e1b-b14f-0162ce029c46.txt'.
File name: 'C:\myapp\bin\Debug\net7.0\5db7d64c-e1c5-4e1b-b14f-0162ce029c46.txt'
   at Microsoft.Win32.SafeHandles.SafeFileHandle.CreateFile(String fullPath, FileMode mode, FileAccess access, FileShare share, FileOptions options)
   at Microsoft.Win32.SafeHandles.SafeFileHandle.Open(String fullPath, FileMode mode, FileAccess access, FileShare share, FileOptions options, Int64 preallocationSize, Nullable`1 unixCreateMode)
   at System.IO.Strategies.OSFileStreamStrategy..ctor(String path, FileMode mode, FileAccess access, FileShare share, FileOptions options, Int64 preallocationSize, Nullable`1 unixCreateMode)
   at System.IO.Strategies.FileStreamHelpers.ChooseStrategyCore(String path, FileMode mode, FileAccess access, FileShare share, FileOptions options, Int64 preallocationSize, Nullable`1 unixCreateMode)
   at System.IO.StreamReader.ValidateArgsAndOpenPath(String path, Encoding encoding, Int32 bufferSize)
   at System.IO.File.ReadAllText(String path, Encoding encoding)
   at XpdfNet.XpdfHelper.GetTextResult(XpdfParameter parameter)
   at XpdfNet.XpdfHelper.ToText(String pdfFilePath, String arguments)

Without second argument

string content = pdfHelper.ToText("C:\\a\\test.pdf");

Works but produces single space delimited result just like iText.

Qiang Fu · Accepted Answer · 2023-05-04T07:10:19.050

1

I found a package XpdfNet may help.

        [HttpGet("test")]
        public void test()
        {
            var pdfHelper = new XpdfHelper();
            String content = pdfHelper.ToText("E:\\test.pdf","-table");
            Console.WriteLine(content);
        }

output

edited May 04 '23 at 07:10

answered May 03 '23 at 13:49

Qiang Fu

1,401
1
2
8

I tried but got file not found exception. Without second parameter conversion works but produces single space delimited result. I updated question – Andrus May 04 '23 at 06:25
@Andrus I ecounter exception when use wrong parameter. But I found a better parameter for pdf `-table` . I modified my answer to a better output. You can reference parameter here: http://www.xpdfreader.com/pdftotext-man.html – Qiang Fu May 04 '23 at 07:08
It worked, thank you. It looks like -layout produces same output as -table. Why -table is better than -layout? Using more than one parameter throws FileNotFoundException. How to use multiple parameters like **ToText("C:\\a\\test.pdf", "-layout −nopgbrk -eol unix")** ? Currently there is empty line between every line in output. How to remove empty lines? How to pass content from stream to it without creating file in MVC controller? – Andrus May 04 '23 at 07:42
@Andrus Sorry, I'm also new to this package. "-layout" miss parse some line in my testing file. Multiple parameters should work as there is a sample command here https://github.com/gqy117/XpdfNet. such as "-lineprinter can be used with -fixed and -linespacing". I think you may need remove empty lines manually. There seems no function for reading pdfsteam in this package.http://www.xpdfreader.com/support.html – Qiang Fu May 04 '23 at 08:10
1

In my pdf file XpdfReader produces multiple rows from single PDF rows by placing texts to different rows. https://stackoverflow.com/questions/24887784/itext-reading-pdf-like-pdftotext-layout/24911617#24911617 describes how to implement layout preserving using iText. It looks like I should use it instead of Xpdfreader since iText converts pdf line to single line in text and is pure .NET code. I created iText DataExtractionStrategy which allows to extract data from converted text easily by separating every PDF chunk in result text by tag. Anyway I marked your answer as solution. – Andrus May 04 '23 at 08:41
@Andrus Thx, mate. I also try to find strategy configuraion but seems not easy. Good to know you got a solution. – Qiang Fu May 04 '23 at 08:56
1

I added **stringBuilder.Append('_')** to LocationExtractionStrategy.GetResultantText() method to separatate every token. Since members used by this method are internal in iText, it was required to copy paste and duplicate number of iText internal classes – Andrus May 04 '23 at 09:19

How to get numbers from pdf if thousands are separated

1 Answers1