Is it possible to read in and parse a .docx file that is linked to on a website without downloading the file (in Java, Python, or another language)?

Question

I'd like to write a program that parses an online .docx file to build an XML document. I know (or at least I think I know) that browsers need a plug-in to view .docx in browser, but I'm not that familiar with plug-ins or how the work. After looking at a .docx file in Notepad++, it seems clear to me that I won't be able to parse the binary data. Is there a way to simulate the opening of the .docx file for my purposes (EDIT: that is, without downloading and saving the file to my hard drive) within the the abilities of any languages or libraries?

My question is more about the opening of the file without downloading than about the actual parsing of it, as I've looked into the Apache POI API for parsing the document in Java.

You need to get the file bytes in order to parse it. That _getting_ from a remote location is known as downloading. — Sotirios Delimanolis, Jun 19 '14 at 15:28
I should say without physically downloading the file and saving to my hard drive. — Matt Shank, Jun 19 '14 at 15:30
To put shortly, no. However you can download it as a Temporary File (I know Java has `File.createTempFile()` and consume no additional resources that way. — Unihedron, Jun 19 '14 at 15:31

score 4 · Accepted Answer · answered Jun 19 '14 at 15:30

Let me try to make this clear.

If you are viewing it, then you have downloaded it. You are "downloading" this webpage in order for your browser to render it. You're "downloading" a link to a document which tells you that there is a document. You cannot view the document unless you download it.

Yes, you have to download it.

Downloading a file is just getting it from the remote server.

Of course, you don't have to write it to your hard drive. You can download it and store it in memory, and then deal with it from memory.

Once you open a connection, you get an InputStream object to read bytes. You can pass that into the Apache POI libraries to read the file.

Thank you. This answer made me realize how stupid this question was. My question was rooted in the fact that I wasn't thinking through how a web server and a browser interact. — Matt Shank, Jun 19 '14 at 15:41

score -1 · Answer 2 · edited May 23 '17 at 12:29

-1

While the above answers are technically correct, what I believe you are asking about is called screen scraping, you can start here.

edited May 23 '17 at 12:29

Community

1
1

answered Jun 19 '14 at 15:34

Stephen B.

204
2
9

Is it possible to read in and parse a .docx file that is linked to on a website without downloading the file (in Java, Python, or another language)?

2 Answers2