2

I am using Textract in python on a web server as part of an API. I want to post a url to the server with a url and have textract extract text from that url (eg http://www.sample-videos.com/pdf/Sample-pdf-5mb.pdf)

I get a 502 proxy error in response when I try to post and my python log shows

textract.exceptions.MissingFileError: The file "http://www.sample-videos.com/pdf/Sample-pdf-5mb.pdf" can not be found.

Is this because Textract can't extract from remote files and if so, is there a work around?

Thanks!

Peter Wood
  • 23,859
  • 5
  • 60
  • 99
jperry1147
  • 276
  • 2
  • 13
  • 1
    Is Textract meant to be able to download from a website? I imagine you have to do that first with [something like `urllib`](https://stackoverflow.com/questions/22676/how-do-i-download-a-file-over-http-using-python) – Peter Wood Nov 23 '16 at 21:31
  • Thanks Peter I think Textract can only process local files and urllib pointed me down the right road. I will add an answer with my code. – jperry1147 Nov 30 '16 at 10:26

0 Answers0