3

How to Extract text with format from PDF or XPS using C#?

I have some PDF/XPS files which are generated by another reporting software. The file mainly includes tables which list some data.

iText can extract the text from the pdf files, but then lost some format, for example, for below table, the extracted text is:

enter image description here

Faults
Count FMI Lookup Code Description Component Status
Active Body Controller Heating Ventilation/Air Conditioning (HVAC) Control 
Head Air Inlet DM1.  HVAC motor in wrong position or 
jammed
SPN 3984 2 126
Active Engine SAE - Catalyst 1 System Monitor - Root cause not known SID 380 11 N/A
Inactive Engine SAE - Crankcase Pressure - Data valid but above normal 
operational range - Most severe level
PID 101 0 N/A
Inactive Engine SAE - Crankcase Pressure - Data erratic, intermittent or 
incorrect
PID 101 2 N/A

The problem is the text in different columns are on the same line, which makes it almost impossible to know exactly which text is for which column. And unfortunately, I need to save data in different columns to different field in database.

I also tried to convert the PDF to html, but then found that html does not include the actual text, which uses SVG in the html. So I could not get actual text.

Is there a way to do this using C#? Any suggestions? Any library, better free ones?

Thanks

urlreader
  • 6,319
  • 7
  • 57
  • 91
  • Questions seeking recommendations for libraries or other resources are off-topic for Stack Overflow. There are numerous options available for extracting text from PDF, but all are limited to some degree by the specifics of the PDF format itself, which is a _layout_ based format (i.e. it's in essence just the Postscript that would be generated when printing a page), which means that there's no requirement that the text as found in the file bears any resemblance to the logical arrangement a human would use when reading the text. ... – Peter Duniho Jul 14 '20 at 16:43
  • ... XPS has a similar limitation; while it's a completely different format (it's essentially a .zip file, so if you're curious you can change the extension to .zip and poke around in it), it's also layout based and so while text often can be extracted, there's no way to guarantee the text is arranged in the visual order a human sees it when rendered. Sophisticated extraction libraries attempt to use the layout itself to guide how the text is extracted, but there will always be exceptions to what they can handle. – Peter Duniho Jul 14 '20 at 16:43
  • 1
    appears these PDFs are software generated, meaning you can create lists to check against the data as you parse. For instance, Status is likely always either Active or Inactive, you could build a list for all columns except description. Once you parse them all, you'll be left with the description text. Lot of extra work but it's doable. – Mikael Jul 14 '20 at 16:46
  • Hi, Mikael, that's what I tried, but realized that these columns do not have a 'format', for example, status may have N/A, Not Available, etc. It could be one word, 2 words, or even longer. This makes it very difficult since I do not know what these values will be. Thanks. – urlreader Jul 14 '20 at 16:53
  • So isn't there a list of possible predefined values for 'status'? (no matter what they particularly are) – CSDev Jul 14 '20 at 17:24
  • no. this is also true for Component column. do not know what they can be. Could be one word, 2 words, or more. this makes difficult to know where to break the line for columns. But I think PDF should have something in it to know this. Just I do not know how to do it. – urlreader Jul 14 '20 at 17:39
  • No, I mean not number of words, but exact words. Looks like those statuses are from some specification, and number of statuses (not number of words in an every single status) is limited. – CSDev Jul 14 '20 at 17:47
  • no, do not know what they can be. I have seen: Active, Inactive, Active Commission, N/A, Not Available. Not sure whether has others. – urlreader Jul 14 '20 at 17:52
  • Can you get to know? From you customer. It would solve your problem. – CSDev Jul 14 '20 at 17:57
  • I think it is a table format .First read https://stackoverflow.com/questions/7513209/using-locationtextextractionstrategy-in-itextsharp-for-text-coordinate/7515625#7515625 then read https://stackoverflow.com/questions/6882098/how-can-i-get-text-formatting-with-itextsharp – LDS Jul 14 '20 at 18:24
  • @LDS Remember that 'iText' 'LocationTextExtractionStrategy' can treat even a whole word as two independent text chunks. So one needs extra work to handle that. – CSDev Jul 14 '20 at 20:02
  • 1
    @urlreader I was just trying to copy a *row* from a "table" in a PDF document when I realized the selection run across *columns*. And not *all* of the table columns either. Copying what looked like 4 columns ended up as 2 rows of field pairs. PDF simply isn't meant for data exchange, it's a repackaged *printer* format (Postscript specifically) – Panagiotis Kanavos Jul 16 '20 at 12:19

2 Answers2

3

You can extract formatted text using Docotic.Pdf (disclaimer: I am the co-author). Here is the basic sample code:

using (var pdf = new PdfDocument("your_document.pdf"))
{
    string formattedText = pdf.GetTextWithFormatting();
    using (var writer = new StreamWriter("formatted.txt"))
        writer.Write(formattedText);
}

Sample result: enter image description here

After that you can detect columns by whitespaces. For example, treat a sequence of 3+ whitespaces as a column separator.

You can find other text extraction techniques in this article. For example, these methods might be useful too:

  • extract text from a specific area (useful if the page contains different tables or a mix of regular text with tables)
  • extract detailed information about every text chunk (useful if you want to build custom table detection logic)
  • extract text with vector paths (useful if you want to respect table borders in your custom table detection algorithm)
Vitaliy Shibaev
  • 1,420
  • 10
  • 24
  • 1
    Spent hours trying to use iText7 to no avail. All of the text was being reported out of order and in different lines than what was on the pdf page. Docotic.Pdf reads the text right on every line. Thanks! – Mighty Ferengi Oct 14 '22 at 20:36
1

If you know all possible values for 'Status', 'Component' and 'Lookup Code' prefix you can use such an approach: you can see that every entry is strucured as 'Status-Component-Description-LookupCode- FMI-Count'. Add an entity:

class Fault
{
    public string Count { get; set; }
    public string FMI { get; set; }
    public string LookupCode { get; set; }
    public string Description { get; set; }
    public string Component { get; set; }
    public string Status { get; set; }

    public override string ToString() =>
        $"Status: {Status}; Component: {Component}; Description: {Description}; LookupCode: {LookupCode}; FMI: {FMI}; Count: {Count}";
}

And map your text input this way:

class Parser
{
    private static readonly IReadOnlyList<string> statuses = new[]
    {
        "Active",
        "Inactive"
        // etc
    };

    private static readonly IReadOnlyList<string> components = new[]
    {
        "Body Controller",
        "Engine"
        // etc
    };

    private static readonly IReadOnlyList<string> lookupPrefixes = new[]
    {
        "SPN",
        "SID",
        "PID"
        // etc
    };

    public static IEnumerable<Fault> Parse(string str)
    {
        var lines = str.Split(Environment.NewLine).Skip(2);
        foreach(var group in GetGroups(lines))
        {
            var words = group.SelectMany(line => line.Split()).ToList();

            var i = 1;
            string status = default;
            while (!statuses.Contains(status = string.Join(' ', words.Take(i)))) i++;
            words = words.Skip(i).ToList();

            i = 1;
            string component = default;
            while (!components.Contains(component = string.Join(' ', words.Take(i)))) i++;
            words = words.Skip(1).Reverse().ToList();

            string count = words[0];

            string fmi = words[1];
            words = words.Skip(2).ToList();

            i = words.FindIndex(word => lookupPrefixes.Contains(word)) + 1;
            string code = string.Join(' ', words.Take(i).Reverse());

            string description = string.Join(' ', words.Skip(i).Reverse());

            yield return new Fault
            {
                Status = status,
                Component = component,
                Description = description,
                LookupCode = code,
                FMI = fmi,
                Count = count
            };
        }
    }

    private static IEnumerable<IEnumerable<string>> GetGroups(IEnumerable<string> lines)
    {
        var list = new List<string> { lines.First() };

        foreach (var line in lines.Skip(1))
        {
            if(statuses.Any(status => line.StartsWith(status)))
            {
                yield return list;

                list = new List<string>();
            }
            list.Add(line);
        }

        yield return list;
    }
}

Then you can use it:

class Program
{
    private static readonly string input =
        @"Faults
Count FMI Lookup Code Description Component Status
Active Body Controller Heating Ventilation/Air Conditioning(HVAC) Control
Head Air Inlet DM1.HVAC motor in wrong position or
jammed
SPN 3984 2 126
Active Engine SAE - Catalyst 1 System Monitor - Root cause not known SID 380 11 N/A
Inactive Engine SAE - Crankcase Pressure - Data valid but above normal
operational range - Most severe level
PID 101 0 N/A
Inactive Engine SAE - Crankcase Pressure - Data erratic, intermittent or
incorrect
PID 101 2 N/A";

    static void Main()
    {
        new Program().Run();
    }

    private void Run()
    {
        foreach (var result in Parser.Parse(input))
            Console.WriteLine(result);
    }
}

and get:

Status: Active; Component: Body Controller; Description: Controller Heating Ventilation/Air Conditioning(HVAC) Control Head Air Inlet DM1.HVAC motor in wrong position or jammed; LookupCode: SPN 3984; FMI: 2; Count: 126
Status: Active; Component: Engine; Description: SAE - Catalyst 1 System Monitor - Root cause not known; LookupCode: SID 380; FMI: 11; Count: N/A
Status: Inactive; Component: Engine; Description: SAE - Crankcase Pressure - Data valid but above normal operational range - Most severe level; LookupCode: PID 101; FMI: 0; Count: N/A
Status: Inactive; Component: Engine; Description: SAE - Crankcase Pressure - Data erratic, intermittent or incorrect; LookupCode: PID 101; FMI: 2; Count: N/A

The solution is subject to optimizations.

CSDev
  • 3,177
  • 6
  • 19
  • 37