I'm not the first person to ask a question like this. However that was 7 months ago and I did some searching to see if maybe the Pandas team addressed this officially, rather than needing to use a work-around.
I first found this issue. Response from maintainer:
Thanks @wangrenz! This is a great use case. Large XML file support was in the works for both lxml and etree parsers (see #40131 ) We may not need an additional argument but have read_xml catch this exception and attempt the lxml workaround.
How large was your XML file? Can you post a reproducible example of its content (redact as needed)?
(The answer was > 10 GB)
Then the issue was closed, referencing a PR.
I read through the PR, but was not really clear on what exactly they're talking about or what the outcome was.
Can someone explain if this was supposed to solve the problem? Or what, generally speaking, the outcome was?
The Notes section of the documentation says:
This method is best designed to import shallow XML documents in following format which is the ideal fit for the two-dimensions of a DataFrame (row by column).
Which seems to suggest using a 2-d structure rather than deeply nested, but terms like "is the ideal fit" and "best designed" aren't exactly precise. I'm left wondering if my use-case (being just 3 elements deep) is a valid use-case or not. DataFrames don't seem to impose such a limitation.
My Pandas version:
1.4.1
My code:
import pandas as pd
df = pd.read_xml('./data/xml/boot.xml')
# >1GB file, sampled below
The sample shown in the documentation:
<root>
<row>
<column1>data</column1>
<column2>data</column2>
<column3>data</column3>
...
</row>
<row>
...
</row>
...
</root>
My XML is one element deeper, but again, I don't think the documentation is clear that there is a hard limitation on this depth:
<Events>
<Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
<System>
<Provider Guid="{9e814aad-3204-11d2-9a82-006008a86939}" />
<EventID>0</EventID>
<Version>3</Version>
<Level>0</Level>
<Task>0</Task>
<Opcode>10</Opcode>
<Keywords>0x0</Keywords>
<TimeCreated SystemTime="2022-03-15T02:38:36.618192200-04:00" />
<Correlation ActivityID="{00000000-0000-0000-0000-000000000000}" />
<Execution ProcessID="4294967295" ThreadID="4294967295" ProcessorID="0" KernelTime="0" UserTime="0" />
<Channel />
<Computer />
</System>
<EventData>
<Data Name="DiskNumber"> 0</Data>
<Data Name="IrpFlags">0x60043</Data>
<Data Name="TransferSize"> 16384</Data>
<Data Name="Reserved"> 0</Data>
<Data Name="ByteOffset">145884773376</Data>
<Data Name="FileObject">0xFFFF8607EDCE8700</Data>
<Data Name="Irp">0xFFFFC10F8E1C0420</Data>
<Data Name="HighResResponseTime">1598</Data>
<Data Name="IssuingThreadId"> 4192</Data>
</EventData>
<RenderingInfo Culture="en-US">
<Opcode>Read</Opcode>
<Provider>MSNT_SystemTrace</Provider>
<EventName xmlns="http://schemas.microsoft.com/win/2004/08/events/trace">DiskIo</EventName>
</RenderingInfo>
<ExtendedTracingInfo xmlns="http://schemas.microsoft.com/win/2004/08/events/trace">
<EventGuid>{3d6fa8d4-fe05-11d0-9dda-00c04fd7ba7c}</EventGuid>
</ExtendedTracingInfo>
</Event>
</Events>
My error is the same as the older question about this:
MemoryError Traceback (most recent call last)
Input In [1], in <cell line: 3>()
1 import pandas as pd
----> 3 df = pd.read_xml('./data/xml/boot.xml')
File ~\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\pandas\util\_decorators.py:311, in deprecate_nonkeyword_arguments.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
305 if len(args) > num_allow_args:
306 warnings.warn(
307 msg.format(arguments=arguments),
308 FutureWarning,
309 stacklevel=stacklevel,
310 )
--> 311 return func(*args, **kwargs)
File ~\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\pandas\io\xml.py:938, in read_xml(path_or_buffer, xpath, namespaces, elems_only, attrs_only, names, encoding, parser, stylesheet, compression, storage_options)
738 @deprecate_nonkeyword_arguments(
739 version=None, allowed_args=["path_or_buffer"], stacklevel=2
740 )
(...)
757 storage_options: StorageOptions = None,
758 ) -> DataFrame:
759 r"""
760 Read XML document into a ``DataFrame`` object.
761
(...)
935 2 triangle 180 3.0
936 """
--> 938 return _parse(
939 path_or_buffer=path_or_buffer,
940 xpath=xpath,
941 namespaces=namespaces,
942 elems_only=elems_only,
943 attrs_only=attrs_only,
944 names=names,
945 encoding=encoding,
946 parser=parser,
947 stylesheet=stylesheet,
948 compression=compression,
949 storage_options=storage_options,
950 )
File ~\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\pandas\io\xml.py:733, in _parse(path_or_buffer, xpath, namespaces, elems_only, attrs_only, names, encoding, parser, stylesheet, compression, storage_options, **kwargs)
730 else:
731 raise ValueError("Values for parser can only be lxml or etree.")
--> 733 data_dicts = p.parse_data()
735 return _data_to_frame(data=data_dicts, **kwargs)
File ~\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\pandas\io\xml.py:389, in _LxmlFrameParser.parse_data(self)
380 """
381 Parse xml data.
382
(...)
385 and parse original or transformed XML and return specific nodes.
386 """
387 from lxml.etree import XML
--> 389 self.xml_doc = XML(self._parse_doc(self.path_or_buffer))
391 if self.stylesheet is not None:
392 self.xsl_doc = XML(self._parse_doc(self.stylesheet))
File ~\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\pandas\io\xml.py:545, in _LxmlFrameParser._parse_doc(self, raw_doc)
531 from lxml.etree import (
532 XMLParser,
533 fromstring,
534 parse,
535 tostring,
536 )
538 handle_data = get_data_from_filepath(
539 filepath_or_buffer=raw_doc,
540 encoding=self.encoding,
541 compression=self.compression,
542 storage_options=self.storage_options,
543 )
--> 545 with preprocess_data(handle_data) as xml_data:
546 curr_parser = XMLParser(encoding=self.encoding)
548 if isinstance(xml_data, io.StringIO):
File ~\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\pandas\io\xml.py:636, in preprocess_data(data)
627 """
628 Convert extracted raw data.
629
(...)
632 StringIO/BytesIO) or is a string or bytes that is an XML document.
633 """
635 if isinstance(data, str):
--> 636 data = io.StringIO(data)
638 elif isinstance(data, bytes):
639 data = io.BytesIO(data)
MemoryError: