0

i have a large xml file of size 10 gb and i want to create a new xml file which is generated from the first record of the large file.i tried to do this in java and python but i got memory error since i'm loading the entire data.

In another post,someone suggested XSLT is the best solution for this.I'm new to XSLT,i don't know how to do this in xslt,pls suggest some style sheet to do this...

Large XML file(10gb) sample:

<MemberDataExport xmlns="http://www.payback.net/lmsglobal/batch/memberdataexport" xmlns:types="http://www.payback.net/lmsglobal/xsd/v1/types">
    <MembershipInfoListItem>
        <MembershipIdentifier>PB00000000001956044</MembershipIdentifier>
        <ParticipationStatus>1</ParticipationStatus>
        <DataSharing>1</DataSharing>
        <MasterInfo>
          <Gender>1</Gender>
          <Salutation>1</Salutation>
          <FirstName>Hazel</FirstName>
          <LastName>Sweetman</LastName>
          <DateOfBirth>1957-03-25</DateOfBirth>
        </MasterInfo>
    </MembershipInfoListItem>
  <Header>
    <BusinessPartner>CHILIS_US</BusinessPartner>
    <FileType>mde</FileType>
    <FileNumber>17</FileNumber>
    <FormatVariant>1</FormatVariant>
    <NumberOfRecords>22</NumberOfRecords>
    <CreationDate>2016-06-07T12:00:46-07:00</CreationDate>
  </Header>
       <MembershipInfoListItem>
        <MembershipIdentifier>PB00000000001956044</MembershipIdentifier>
        <ParticipationStatus>1</ParticipationStatus>
        <DataSharing>1</DataSharing>
        <MasterInfo>
          <Gender>1</Gender>
          <Salutation>1</Salutation>
          <FirstName>Hazel</FirstName>
          <LastName>Sweetman</LastName>
          <DateOfBirth>1957-03-25</DateOfBirth>
        </MasterInfo>
    </MembershipInfoListItem>
.....
.....
 </MemberDataExport>

I want to create a file like this..

    <MemberDataExport xmlns="http://www.payback.net/lmsglobal/batch/memberdataexport" xmlns:types="http://www.payback.net/lmsglobal/xsd/v1/types">
        <MembershipInfoListItem>
            <MembershipIdentifier>PB00000000001956044</MembershipIdentifier>
            <ParticipationStatus>1</ParticipationStatus>
            <DataSharing>1</DataSharing>
            <MasterInfo>
              <Gender>1</Gender>
              <Salutation>1</Salutation>
              <FirstName>Hazel</FirstName>
              <LastName>Sweetman</LastName>
              <DateOfBirth>1957-03-25</DateOfBirth>
            </MasterInfo>
        </MembershipInfoListItem>
</MemberDataExport>

is there any other way i can do this without getting any memory error? pls suggest that too.

mariz
  • 509
  • 1
  • 7
  • 13

2 Answers2

0

You didn't show your code, so we can't possibly know what you're doing right or wrong. However, I'd bet any parser would need to load the entire file just to check if syntax is OK, no missing tags etc. and that will surely cause an OutOfMemory error for a 10 GB file.
So, just in this case, my approach would be to read the file line by line using a BufferedStreamReader (see How to read a large text file line by line using Java?) and just stop when you reach a line that contains your closing tag, i.e. </MembershipInfoListItem>:

StringBuilder sb = new StringBuilder("<MemberDataExport xmlns=\"http://www.payback.net/lmsglobal/batch/memberdataexport\" xmlns:types=\"http://www.payback.net/lmsglobal/xsd/v1/types\">");
sb.append(System.lineSeparator());
try (BufferedReader br = new BufferedReader(new FileReader(file))) {
    String line;
    while ((line = br.readLine()) != null) {
        // process the line
        sb.append(line);
        sb.append(System.lineSeparator());
        if (line.contains("</MembershipInfoListItem>")) {
            break;
        }
    }
    sb.append("</MemberDataExport>");
} catch (IOException | AnyOtherExceptionNeeded ex) {
    // log or rethrow
}

Now sb.toString() will return what you want.

Community
  • 1
  • 1
walen
  • 7,103
  • 2
  • 37
  • 58
0

In Python (which you mentioned besides Java) you could use ElementTree.iterparse and then break parsing when you have found the element(s) you want to copy:

import xml.etree.ElementTree as ET
count = 0
copy = 1 # set this to the number of second level (i.e. children of the root) elements you want to copy
level = -1

for event, elem in ET.iterparse('input1.xml', events = ('start', 'end')):
    if event == 'start':
        level = level + 1
        if level == 0:
            result = ET.ElementTree(ET.Element(elem.tag))

    if event == 'end':
        level = level - 1
        if level == 0:
            count = count + 1
            if count <= copy:
                result.getroot().append(elem)
            else:
                break



result.write('result1.xml', 'UTF-8', True, 'http://www.payback.net/lmsglobal/batch/memberdataexport')

As for better namespace prefix preservation, I have had some success using the event start-ns and registering the collected namespaces on the ElementTree:

import xml.etree.ElementTree as ET
count = 0
copy = 1 # set this to the number of second level (i.e. children of the root) elements you want to copy
level = -1

for event, elem in ET.iterparse('input1.xml', events = ('start', 'end', 'start-ns')):
    if event == 'start':
        level = level + 1
        if level == 0:
            result = ET.ElementTree(ET.Element(elem.tag))

    if event == 'end':
        level = level - 1
        if level == 0:
            count = count + 1
            if count <= copy:
                result.getroot().append(elem)
            else:
                break

    if event == 'start-ns':
        ET.register_namespace(elem[0], elem[1])


result.write('result1.xml', 'UTF-8', True)
Martin Honnen
  • 160,499
  • 6
  • 90
  • 110
  • -this code is working fine but if xml contains namespace like xmlns:h="http://www.w3.org/TR/html4/" in big file,the generated file contains like this xmlns:ns0="http://www.w3.org/TR/html4/".but i want the original prefix only,any solution for this? – mariz Jul 29 '16 at 10:07
  • It seems namespaces are reported by `iterparse` as separate events `start-ns` and `end-ns` where `elem` is then a tuple of `(prefix, namespaceURI)` but I have not found out how to copy them to the elements read out in the `end` event or how to make sure the serialization with the `write` method uses those namespaces. – Martin Honnen Jul 29 '16 at 10:39
  • @mariz, I have edited the answer with a second sample that also collects `start-ns` events and registers them on the `ElementTree`, that way it seems the serialization done with the `write` call uses them, at least here with Python 3.4. – Martin Honnen Jul 29 '16 at 11:59