0

I'd like to write a program that parses an online .docx file to build an XML document. I know (or at least I think I know) that browsers need a plug-in to view .docx in browser, but I'm not that familiar with plug-ins or how the work. After looking at a .docx file in Notepad++, it seems clear to me that I won't be able to parse the binary data. Is there a way to simulate the opening of the .docx file for my purposes (EDIT: that is, without downloading and saving the file to my hard drive) within the the abilities of any languages or libraries?

My question is more about the opening of the file without downloading than about the actual parsing of it, as I've looked into the Apache POI API for parsing the document in Java.

Matt Shank
  • 158
  • 1
  • 1
  • 12

2 Answers2

4

Let me try to make this clear.

If you are viewing it, then you have downloaded it. You are "downloading" this webpage in order for your browser to render it. You're "downloading" a link to a document which tells you that there is a document. You cannot view the document unless you download it.

Yes, you have to download it.

Downloading a file is just getting it from the remote server.

Of course, you don't have to write it to your hard drive. You can download it and store it in memory, and then deal with it from memory.

Once you open a connection, you get an InputStream object to read bytes. You can pass that into the Apache POI libraries to read the file.

Anubian Noob
  • 13,426
  • 6
  • 53
  • 75
-1

While the above answers are technically correct, what I believe you are asking about is called screen scraping, you can start here.

Community
  • 1
  • 1
Stephen B.
  • 204
  • 2
  • 9