What exactly are the limitations on Pandas read_xml now?

Question

I'm not the first person to ask a question like this. However that was 7 months ago and I did some searching to see if maybe the Pandas team addressed this officially, rather than needing to use a work-around.

I first found this issue. Response from maintainer:

Thanks @wangrenz! This is a great use case. Large XML file support was in the works for both lxml and etree parsers (see #40131 ) We may not need an additional argument but have read_xml catch this exception and attempt the lxml workaround.

How large was your XML file? Can you post a reproducible example of its content (redact as needed)?

(The answer was > 10 GB)

Then the issue was closed, referencing a PR.

I read through the PR, but was not really clear on what exactly they're talking about or what the outcome was.

Can someone explain if this was supposed to solve the problem? Or what, generally speaking, the outcome was?

The Notes section of the documentation says:

This method is best designed to import shallow XML documents in following format which is the ideal fit for the two-dimensions of a DataFrame (row by column).

Which seems to suggest using a 2-d structure rather than deeply nested, but terms like "is the ideal fit" and "best designed" aren't exactly precise. I'm left wondering if my use-case (being just 3 elements deep) is a valid use-case or not. DataFrames don't seem to impose such a limitation.

My Pandas version:

1.4.1

My code:

import pandas as pd
df = pd.read_xml('./data/xml/boot.xml')
# >1GB file, sampled below

The sample shown in the documentation:

<root>
    <row>
      <column1>data</column1>
      <column2>data</column2>
      <column3>data</column3>
      ...
   </row>
   <row>
      ...
   </row>
   ...
</root>

My XML is one element deeper, but again, I don't think the documentation is clear that there is a hard limitation on this depth:

<Events>
    <Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
        <System>
            <Provider Guid="{9e814aad-3204-11d2-9a82-006008a86939}" />
            <EventID>0</EventID>
            <Version>3</Version>
            <Level>0</Level>
            <Task>0</Task>
            <Opcode>10</Opcode>
            <Keywords>0x0</Keywords>
            <TimeCreated SystemTime="2022-03-15T02:38:36.618192200-04:00" />
            <Correlation ActivityID="{00000000-0000-0000-0000-000000000000}" />
            <Execution ProcessID="4294967295" ThreadID="4294967295" ProcessorID="0" KernelTime="0" UserTime="0" />
            <Channel />
            <Computer />
        </System>
        <EventData>
            <Data Name="DiskNumber">       0</Data>
            <Data Name="IrpFlags">0x60043</Data>
            <Data Name="TransferSize">   16384</Data>
            <Data Name="Reserved">       0</Data>
            <Data Name="ByteOffset">145884773376</Data>
            <Data Name="FileObject">0xFFFF8607EDCE8700</Data>
            <Data Name="Irp">0xFFFFC10F8E1C0420</Data>
            <Data Name="HighResResponseTime">1598</Data>
            <Data Name="IssuingThreadId">    4192</Data>
        </EventData>
        <RenderingInfo Culture="en-US">
            <Opcode>Read</Opcode>
            <Provider>MSNT_SystemTrace</Provider>
            <EventName xmlns="http://schemas.microsoft.com/win/2004/08/events/trace">DiskIo</EventName>
        </RenderingInfo>
        <ExtendedTracingInfo xmlns="http://schemas.microsoft.com/win/2004/08/events/trace">
            <EventGuid>{3d6fa8d4-fe05-11d0-9dda-00c04fd7ba7c}</EventGuid>
        </ExtendedTracingInfo>
    </Event>
</Events>

My error is the same as the older question about this:

MemoryError                               Traceback (most recent call last)
Input In [1], in <cell line: 3>()
      1 import pandas as pd
----> 3 df = pd.read_xml('./data/xml/boot.xml')

File ~\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\pandas\util\_decorators.py:311, in deprecate_nonkeyword_arguments.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
    305 if len(args) > num_allow_args:
    306     warnings.warn(
    307         msg.format(arguments=arguments),
    308         FutureWarning,
    309         stacklevel=stacklevel,
    310     )
--> 311 return func(*args, **kwargs)

File ~\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\pandas\io\xml.py:938, in read_xml(path_or_buffer, xpath, namespaces, elems_only, attrs_only, names, encoding, parser, stylesheet, compression, storage_options)
    738 @deprecate_nonkeyword_arguments(
    739     version=None, allowed_args=["path_or_buffer"], stacklevel=2
    740 )
   (...)
    757     storage_options: StorageOptions = None,
    758 ) -> DataFrame:
    759     r"""
    760     Read XML document into a ``DataFrame`` object.
    761 
   (...)
    935     2  triangle      180    3.0
    936     """
--> 938     return _parse(
    939         path_or_buffer=path_or_buffer,
    940         xpath=xpath,
    941         namespaces=namespaces,
    942         elems_only=elems_only,
    943         attrs_only=attrs_only,
    944         names=names,
    945         encoding=encoding,
    946         parser=parser,
    947         stylesheet=stylesheet,
    948         compression=compression,
    949         storage_options=storage_options,
    950     )

File ~\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\pandas\io\xml.py:733, in _parse(path_or_buffer, xpath, namespaces, elems_only, attrs_only, names, encoding, parser, stylesheet, compression, storage_options, **kwargs)
    730 else:
    731     raise ValueError("Values for parser can only be lxml or etree.")
--> 733 data_dicts = p.parse_data()
    735 return _data_to_frame(data=data_dicts, **kwargs)

File ~\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\pandas\io\xml.py:389, in _LxmlFrameParser.parse_data(self)
    380 """
    381 Parse xml data.
    382 
   (...)
    385 and parse original or transformed XML and return specific nodes.
    386 """
    387 from lxml.etree import XML
--> 389 self.xml_doc = XML(self._parse_doc(self.path_or_buffer))
    391 if self.stylesheet is not None:
    392     self.xsl_doc = XML(self._parse_doc(self.stylesheet))

File ~\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\pandas\io\xml.py:545, in _LxmlFrameParser._parse_doc(self, raw_doc)
    531 from lxml.etree import (
    532     XMLParser,
    533     fromstring,
    534     parse,
    535     tostring,
    536 )
    538 handle_data = get_data_from_filepath(
    539     filepath_or_buffer=raw_doc,
    540     encoding=self.encoding,
    541     compression=self.compression,
    542     storage_options=self.storage_options,
    543 )
--> 545 with preprocess_data(handle_data) as xml_data:
    546     curr_parser = XMLParser(encoding=self.encoding)
    548     if isinstance(xml_data, io.StringIO):

File ~\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\pandas\io\xml.py:636, in preprocess_data(data)
    627 """
    628 Convert extracted raw data.
    629 
   (...)
    632 StringIO/BytesIO) or is a string or bytes that is an XML document.
    633 """
    635 if isinstance(data, str):
--> 636     data = io.StringIO(data)
    638 elif isinstance(data, bytes):
    639     data = io.BytesIO(data)

MemoryError:

In your traceback it appears pandas reads the file into a `io.StringIO` object, and that is what is failing. So this isn't a pandas issue - at least, except for the fact that pandas reads in the whole file. you seem to actually be memory constrained parsing the full file. see if simply `s = open(filepath).read()` causes the same issue? how big a file are you working with? — Michael Delgado, Mar 19 '22 at 21:29
@MichaelDelgado 1.09GB. `s = open('./data/xml/test.xml').read() print(s[0:10])` runs successfully, outputting: ` <` — J.Todd, Mar 19 '22 at 21:32
@MichaelDelgado edit, oh sorry, test.xml is smaller, you may be right — J.Todd, Mar 19 '22 at 21:32
@MichaelDelgado Ok final answer (>1GB file) `s = open('./data/xml/boot.xml').read() print(s[0:10])` outputs: ` <`, no error. — J.Todd, Mar 19 '22 at 21:33
I'm not sure if this actually does anything (since it's a notebook command, not lab), but I'm launching jupyterlab via `python3 -m jupyterlab --NotebookApp.max_buffer_size=10000000`. Have 32GB memory on the system. — J.Todd, Mar 19 '22 at 21:34
oh - the PR you referenced was merged *yesterday*! so if you're working with the development version of pandas and it's up to date with main you'll have that fix - otherwise not :) — Michael Delgado, Mar 19 '22 at 21:36
@MichaelDelgado ahh I didn't notice that, thanks! How would I pull the latest version? Do I need to build from source? — J.Todd, Mar 19 '22 at 21:36
the [contributing docs](https://pandas.pydata.org/docs/development/contributing_environment.html#contributing-environment) have a guide for building pandas from source. it's not easy... good luck! otherwise you'll need to wait for the next release. — Michael Delgado, Mar 19 '22 at 21:37
your other option, assuming most of the xml document is stuff you don't need in your dataframe, is to parse the data yourself with e.g. lxml's [iterparse](https://lxml.de/api/lxml.etree.iterparse-class.html), which is used in that PR. then you could structure the data into a list of lists or somehting else which can be used to construct a DataFrame directly. — Michael Delgado, Mar 19 '22 at 21:43
@MichaelDelgado Yes, I think I might do something lazy like truncate to the max size I can load until the next release :) Thanks for the help. — J.Todd, Mar 19 '22 at 21:44

What exactly are the limitations on Pandas read_xml now?

0 Answers0