I have a project about getting XML files from URL's, scraping them, pulling the data, then processing it. Also, I am creating the URL with user input. But I need to check if the URL contains XML file to scrape. Any ideas how to do that? So basically how to check if URL contains XML file or not?
Asked
Active
Viewed 622 times
1
-
1why don't you simply download that file (any file) that is present at the given URL. Run it through an XML parser, if the parser succeeds, then its a well formed XML file. Then scrape it as you like. There is no way to ensure before actually looking into the file if it has an XML content in it or not – office.aizaz Jan 23 '22 at 14:36
-
if i can check the file before it simply recovers so much time for code to work.Also i don't need to read and write that specific URL and don't waste time on that. @office.aizaz – Epitaph Jan 23 '22 at 14:42
-
URL is like an address. Who lives at that address can only be known once you knock on that door. Unless the URL comes with some tags which suggests what kinds of files it's hosting, I don't think there is a way to know this before time. The file extension can be one way but they are also not true representative of the file content. – office.aizaz Jan 23 '22 at 14:55
-
i think you are right.Thanks for the answer.@office.aizaz – Epitaph Jan 23 '22 at 15:02
1 Answers
2
Ways to know whether GETing a URL will retrieve XML...
Before retrieving the file
- Have an out-of-band guarantee.
- Inspect
Content-Type
HTTP header of response to a HEAD request1.
After retrieving the file
- Inspect
Content-Type
HTTP header of the response1. - Sniff root element.
Files.probeContentType(path)
- Parse via conforming XML parser without getting any well-formedness errors.
Note: Only parsing via a conforming XML parser is guaranteed to provide 100% determination.
1 MIME assignments for XML data:
application/xml
(RFC 7303, previously RFC 3023)text/xml
(RFC 7303, previously RFC 3023)- Other MIME assignments used with XML applications.

kjhughes
- 106,133
- 27
- 181
- 240