0

My app need to process input from PDF files consisting of text (mostly). I could do the parsing on my server, but I'd prefer not to. Anyway, after exploring my options for text extraction I found PDFBox library and its port to use with Android (https://github.com/TomRoush/PdfBox-Android)

In the app I show my users a standard UI for selecting the source document through ACTION_OPEN_DOCUMENT. Then override onActivityResult to get Uri - you know, the usual stuff.

The problem is that I can't figure out how to feed it to PDFBox. Since we're not talking "files" but rather "documents" and the lib wants a real file path. If I provide it with it for a certain file, the text parsing goes okay, but it's certainly not the best practice and it can't be done for all documents out there (cloud storage etc) so instead I do this:

InputStream inputStream = getContentResolver().openInputStream(uri);

and then read it line by line so in the end I can have it all in one string. Obviously, it works okay.

But how to actually input this data into PDFBox to do its text extraction magic? I can't find any docs on how to do it in a scenario when I don't have the "real file path".

Maybe there are better ways now? This library is quite old.. Basically I need to extract text from PDF and do it on an Android device, not through an API call. Really stuck here.

galloper
  • 813
  • 1
  • 9
  • 17
  • I really don't understand why you need an Android port, for something that doesn't require porting. And one cannot read binary files line by line, as they have no lines. There is no magic involved either ... – Martin Zeitler May 23 '20 at 09:08
  • @MikeM. check out the sample code section for "stripText" – galloper May 23 '20 at 09:21
  • https://github.com/TomRoush/PdfBox-Android/blob/master/library/src/main/java/com/tom_roush/pdfbox/pdmodel/PDDocument.java#L943 – Mike M. May 23 '20 at 09:22
  • @galloper Yeah, that's one that indicated that a `File` isn't necessary. Notice how they're loading the `PDDocument` directly from the `InputStream` obtained from `AssetManager`. – Mike M. May 23 '20 at 09:24
  • @MartinZeitler believe me it does require porting, otherwise the ported version wouldn't appear. I tried using the original before turning to this one. I get that I can read the stream then save as file, I was hoping to avoid it – galloper May 23 '20 at 09:25
  • That is a religion, because I use it without the least problem, so there is no need to believe. And it also has that method: https://pdfbox.apache.org/docs/2.0.13/javadocs/org/apache/pdfbox/pdmodel/PDDocument.html#load-java.io.InputStream- – Martin Zeitler May 23 '20 at 09:28
  • And generlly speaking, unless this is PDF A/3 format (eg. with an embedded cross-industry invoice, alike ZUGFeRD), you might have to write an extra parser per document type, as there is no standard structure. Despite I've reopened the question, because the dupe was wrongful, I still don't think one could answer that, unless having a sample PDF. – Martin Zeitler May 23 '20 at 09:34
  • Thank you both guys, seems I overlooked this. Now it's working, though now I'm getting "Error getting header version" on my files, but it's progress :) – galloper May 23 '20 at 09:42
  • @MartinZeitler: 146 Java files in PDFBox 2.0.19 contain references to `java.awt`, and only 3 of those are for `java.awt.font` (the only tiny bit of AWT in the Android SDK). So while your use of PDFBox might avoid those classes, presumably other code that uses other PDFBox features would run into problems on Android. My understanding is that the Android port is mostly clearing up that sort of problem. – CommonsWare May 23 '20 at 11:07

1 Answers1

3

I needed similar functionality for my app so I've tried solution suggested by Mike M. in comments under your question and it worked great for me (so this is really his answer – I just confirmed that it works and supplied the code). Hope it helps.

The “magic” is actually in these two lines:

InputStream inputStream = this.getContentResolver().openInputStream(fileUri);
document = PDDocument.load(inputStream);

But for some context (and for those who will search an answer for this problem on another occasion) here is whole example code:

public class MainActivity extends AppCompatActivity {

    private static final int OPEN_FILE_REQUEST_CODE = 1;
    Intent intentOpenfile;
    Uri fileUri;

    TextView tvTextDisplay;
    Button bOpenFile;

    @Override
    protected void onCreate(Bundle savedInstanceState) {
        super.onCreate(savedInstanceState);
        setContentView(R.layout.activity_main);

        tvTextDisplay = findViewById(R.id.tv_text_display);

        PDFBoxResourceLoader.init(getApplicationContext());

        bOpenFile = findViewById(R.id.b_open_file);
        bOpenFile.setOnClickListener(new View.OnClickListener() {
            @Override
            public void onClick(View v) {
                intentOpenfile = new Intent(Intent.ACTION_OPEN_DOCUMENT);
                intentOpenfile.setType("application/pdf");
                startActivityForResult(intentOpenfile, OPEN_FILE_REQUEST_CODE);
            }
        });
    }

    @Override
    protected void onActivityResult(int requestCode, int resultCode, @Nullable Intent data) {
        super.onActivityResult(requestCode, resultCode, data);
        if (requestCode == OPEN_FILE_REQUEST_CODE) {
            if(resultCode == RESULT_OK) {
                fileUri = data.getData();
                PDDocument document = null;
                String parsedText = null;
                try {
                    InputStream inputStream = this.getContentResolver().openInputStream(fileUri);
                    document = PDDocument.load(inputStream);
                } catch (IOException e) {
                    e.printStackTrace();
                }

                try {
                    PDFTextStripper pdfStripper = new PDFTextStripper();
                    pdfStripper.setStartPage(0);
                    pdfStripper.setEndPage(1);
                    parsedText = "Parsed text: " + pdfStripper.getText(document);
                } catch (IOException e) {
                    e.printStackTrace();
                }finally {
                    try {
                        if (document != null) document.close();
                    } catch (IOException e) {
                        e.printStackTrace();
                    }
                }
                tvTextDisplay.setText(parsedText);

            }
        }
    }
}
sumo
  • 48
  • 4