The answer to both your questions is: It depends.
- Suppose that you have a ZUGFeRD invoice. In that case, the invoice is a PDF/A-3 document that has an embedded file in the CII XML format. It is very easy to extract this XML and read it to get all the necessary information about the invoice. The concept of embedded or attached file that contain the source of the data used to create the PDF, or the data in an alternative form than PDF, is a technique that is used to allow what you need.
- You can extract text from a PDF. This is explained in questions such as PDF text extraction using iText but you only get the raw text without formatting. In many cases, a PDF consists of a bunch of text and lines put on a canvas at absolute positions. A word on the page does not know if it's part of a sentence, part of a cell, etc. Unless:
- If the PDF is a Tagged PDF, then the PDF also contains information about the structure of the content. For instance: the content will contain tags that indicate structures such as tables, table headers, table rows, table cells. If you are talking about Tagged PDFs, then it's possible to extract the text in a structured way.
In the past, we have done project where we received credit card statements from VISA, MasterCard, AmEx,... We had to extract all the expenses and store them as records in a database. We were able to achieve this, because the format of the statements was predictable: all VISA statements are created alike, hence we were able to find the pattern that allowed us to extract the data.
It goes without saying that we do not share the code we used to do this. The company that paid us for doing that project would not be pleased.