Python Textract form URL

Asked Nov 23 '16 at 21:22

Active Nov 23 '16 at 21:29

Viewed 1,292 times

I am using Textract in python on a web server as part of an API. I want to post a url to the server with a url and have textract extract text from that url (eg http://www.sample-videos.com/pdf/Sample-pdf-5mb.pdf)

I get a 502 proxy error in response when I try to post and my python log shows

textract.exceptions.MissingFileError: The file "http://www.sample-videos.com/pdf/Sample-pdf-5mb.pdf" can not be found.

Is this because Textract can't extract from remote files and if so, is there a work around?

Thanks!

edited Nov 23 '16 at 21:29

Peter Wood

asked Nov 23 '16 at 21:22

jperry1147

1

Is Textract meant to be able to download from a website? I imagine you have to do that first with [something like `urllib`](https://stackoverflow.com/questions/22676/how-do-i-download-a-file-over-http-using-python) – Peter Wood Nov 23 '16 at 21:31
Thanks Peter I think Textract can only process local files and urllib pointed me down the right road. I will add an answer with my code. – jperry1147 Nov 30 '16 at 10:26

0 Answers0