0

Here's the link to the API with an ID provided by another API, it should work but currently it doesn't because they have not indexed it. The problem is that link returns 200 and an empty root of xml.

xml link

I'm new-ish to Python but basically the issue is that ID returns an empty XML root, response is 200 and I can see it does return something but empty and Elementtree gives me error:

with response.text:

try:
    xml = r.text
    if xml:
        root = ElementTree.parse(xml)
except ElementTree.ParseError:

with error: FileNotFoundError: [Errno 2] No such file or directory: '\n\n\n'

and also tried as bytes:

try:
    xml = r.content
    if xml:
        root = ElementTree.fromstring(xml)
except ElementTree.ParseError:

with error: TypeError: Parser must be a string or character stream, not NoneType

I can't seem to trigger the catch because it's always 200. How can I check the validity/existence of the xml before parsing?

I have thousands of docs to parse and this error breaks it all.

Marco Cano
  • 367
  • 1
  • 4
  • 18
  • under the exception I just had a random print statement* – Marco Cano Apr 20 '20 at 19:30
  • I am not sure I understand your question completely but in general there's nothing wrong with using try/except for flow control – NomadMonad Apr 20 '20 at 19:36
  • @NomadMonad that's what I thought but IDK why it's not acting as expected. Basically I just want to parse the xml returned and if it's empty skip it or ignore it. But keeps trying to parse it anyway – Marco Cano Apr 20 '20 at 19:55
  • There's nothing wrong with the page; it contains a validly formed xml document of only one tag. So it's really a question of how you define "empty" - shorter than a particular number of tags? Not containing certain expected info? etc. Once you define that, it should be easy to create an `if` statement to skip this type of pages. – Jack Fleeting Apr 20 '20 at 20:37
  • @JackFleeting how would you do it? looks valid to me too, so it should be parseable but how can I check it before it is parsed as xml? you see my problem> I can't check the validity without first parsing it, maybe I'm doing something wrong? – Marco Cano Apr 20 '20 at 20:42
  • See answer with a possible approach. – Jack Fleeting Apr 20 '20 at 20:46

1 Answers1

0

Try something like this:

url = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=32277197&rettype=abstract"
import requests

from lxml import etree
resp = requests.get(url)

doc = etree.XML(resp.content)

floor = doc.xpath('count(//*)')
if floor < 3: #or whatever
     print("I'm outta here...")

Edit: Or with XML:

import xml.etree.ElementTree as ET
doc = ET.fromstring(resp.text)
floor = doc.findall(".//")
if len(floor) < 3:
    print("I'm outta here...")

Output:

I'm outta here...
Jack Fleeting
  • 24,385
  • 6
  • 23
  • 45
  • AttributeError: 'xml.etree.ElementTree.Element' object has no attribute 'xpath' and that led me to this: https://stackoverflow.com/questions/13455672/python-xpath-not-available-in-elementtree – Marco Cano Apr 20 '20 at 20:53
  • my bad, that's for lxml not xml library..., I can only use xml tho, so maybe I'll have to figure something else out. – Marco Cano Apr 20 '20 at 20:58
  • I think I got it: it does parse like you said but I need to check the existence of children like this: children = list(root.iter()) if children: do something... , that seems to work – Marco Cano Apr 20 '20 at 21:16