1

I have below xml file where i want to replace multiple different text using sed or may be other command or using Python code.

I have 4GB xml file so performance should also be factor while replacing text.

For example replace text xmlns:leif="http://www.leif.org/concatenated-file/header-extension/2.0" xmlns:lei="http://www.leif.org/data/schema/leidata/2016" as empty

replace text lei: as empty

replace text leif: as empty

replace text xmlns:lei="http://www.leif.org/data/schema/leidata/2016" as empty

Can i do this in one sed command ?

Below is how xml file looks like:

<?xml version="1.0" encoding="UTF-8"?>
<lei:LEIData xmlns:leif="http://www.leif.org/concatenated-file/header-extension/2.0" xmlns:lei="http://www.leif.org/data/schema/leidata/2016">
<lei:LEIHeader>
<lei:ContentDate>2022-07-10T09:00:01Z</lei:ContentDate>
<lei:Originator>234234234234</lei:Originator>
<lei:FileContent>leif_FULL_PUBLISHED</lei:FileContent>
<lei:RecordCount>2166947</lei:RecordCount>
<lei:Extension>
<leif:Sources>
  <leif:Source>
    <leif:ContentDate>2022-07-09T11:01:36Z</leif:ContentDate>
    <leif:RecordCount>412</leif:RecordCount>
  </leif:Source>
  <leif:Source>
    <leif:ContentDate>2022-07-09T16:00:02Z</leif:ContentDate>
    <leif:RecordCount>3084</leif:RecordCount>
  </leif:Source>
</leif:Sources>
</lei:Extension>
</lei:LEIHeader>
<lei:LEIRecords>
<lei:LEIRecord xmlns:lei="http://www.leif.org/data/schema/leidata/2016">
  <lei:LEI>029200013A5N6ZD0F605</lei:LEI>
  <lei:Entity>
    <lei:LegalName xml:lang="en">AFRINVEST SECURITIES LIMITED</lei:LegalName>
    <lei:LegalAddress xml:lang="en">
      <lei:FirstAddressLine>27 GERRARD ROAD</lei:FirstAddressLine>
    </lei:LegalAddress>
    <lei:HeadquartersAddress xml:lang="en">
      <lei:FirstAddressLine>27 GERRARD ROAD</lei:FirstAddressLine>
    </lei:HeadquartersAddress>
    <lei:RegistrationAuthority>
      <lei:RegistrationAuthorityID>RA000469</lei:RegistrationAuthorityID>
    </lei:RegistrationAuthority>
    <lei:LegalJurisdiction>NG</lei:LegalJurisdiction>
    <lei:EntityCategory>GENERAL</lei:EntityCategory>
    <lei:LegalForm>
      <lei:EntityLegalFormCode>9999</lei:EntityLegalFormCode>
      <lei:OtherLegalForm>LIMITED</lei:OtherLegalForm>
    </lei:LegalForm>
    <lei:EntityStatus>ACTIVE</lei:EntityStatus>
    <lei:EntityCreationDate>2014-11-06T00:00:00Z</lei:EntityCreationDate>
  </lei:Entity>
  <lei:Registration>
    <lei:InitialRegistrationDate>2014-11-06T00:00:00Z</lei:InitialRegistrationDate>
    <lei:ValidationAuthority>
      <lei:ValidationAuthorityID>RA000469</lei:ValidationAuthorityID>
    </lei:ValidationAuthority>
  </lei:Registration>
</lei:LEIRecord>
</lei:LEIRecords>
</lei:LEIData>
Andrew
  • 3,632
  • 24
  • 64
  • 113
  • So you basically want to strip the xml of namespaces, right? See this answer for [Remove namespace and prefix from xml in python using lxml](https://stackoverflow.com/a/51972010/3589122). – GordonAitchJay Jul 15 '22 at 09:50
  • 1
    i am already trying it and its been 15 minutes already the python code still running..i have 4GB xml file so performance is also factor while replacing text..Using sed its very fast but i dont know how to use sed for multiple text replace – Andrew Jul 15 '22 at 10:26

1 Answers1

1

Using sed

$ sed -E 's/lei:|leif://g;s/ xmlns:lei=.*2016"| xmlns:leif=.*2016"//' input_file
<?xml version="1.0" encoding="UTF-8"?>
<LEIData>
<LEIHeader>
<ContentDate>2022-07-10T09:00:01Z</ContentDate>
<Originator>234234234234</Originator>
<FileContent>leif_FULL_PUBLISHED</FileContent>
<RecordCount>2166947</RecordCount>
<Extension>
<Sources>
  <Source>
    <ContentDate>2022-07-09T11:01:36Z</ContentDate>
    <RecordCount>412</RecordCount>
  </Source>
  <Source>
    <ContentDate>2022-07-09T16:00:02Z</ContentDate>
    <RecordCount>3084</RecordCount>
  </Source>
</Sources>
</Extension>
</LEIHeader>
<LEIRecords>
<LEIRecord>
  <LEI>029200013A5N6ZD0F605</LEI>
  <Entity>
    <LegalName xml:lang="en">AFRINVEST SECURITIES LIMITED</LegalName>
    <LegalAddress xml:lang="en">
      <FirstAddressLine>27 GERRARD ROAD</FirstAddressLine>
    </LegalAddress>
    <HeadquartersAddress xml:lang="en">
      <FirstAddressLine>27 GERRARD ROAD</FirstAddressLine>
    </HeadquartersAddress>
    <RegistrationAuthority>
      <RegistrationAuthorityID>RA000469</RegistrationAuthorityID>
    </RegistrationAuthority>
    <LegalJurisdiction>NG</LegalJurisdiction>
    <EntityCategory>GENERAL</EntityCategory>
    <LegalForm>
      <EntityLegalFormCode>9999</EntityLegalFormCode>
      <OtherLegalForm>LIMITED</OtherLegalForm>
    </LegalForm>
    <EntityStatus>ACTIVE</EntityStatus>
    <EntityCreationDate>2014-11-06T00:00:00Z</EntityCreationDate>
  </Entity>
  <Registration>
    <InitialRegistrationDate>2014-11-06T00:00:00Z</InitialRegistrationDate>
    <ValidationAuthority>
      <ValidationAuthorityID>RA000469</ValidationAuthorityID>
    </ValidationAuthority>
  </Registration>
</LEIRecord>
</LEIRecords>
</LEIData>
HatLess
  • 10,622
  • 5
  • 14
  • 32
  • not working just running through the complete xml file – Andrew Jul 15 '22 at 11:53
  • i also tried that but its printing output and not doing anything..and as i said its 4GB xml file so its consuming too much time – Andrew Jul 18 '22 at 06:57
  • i am running the command in Linux..also the perfromance is very very slow..when i run sed command like : sed -i 's#leif:#lei:#' gleif.xml it takes around 2 minutes and then i run again another sed command to replace text it takes around 2 minutes again...but the command you mentioned its still running after 10 minutes and dont know when will it finished – Andrew Jul 18 '22 at 07:04
  • @Andrew Try running it on a smaller sample of your data first before running it on the actual file. Does this work `sed -E 's/[xl][me][li]f?(ns)?:(leif?=.*2016)?//g' file`? This also works on mac – HatLess Jul 18 '22 at 07:19
  • ok i will try this now – Andrew Jul 18 '22 at 07:41
  • the result is bit strange ..the file is getting deleted automatically from directory after i run this command dont know the reason...when we run this command can we create new xml file with the result of the sed command ? – Andrew Jul 18 '22 at 08:01
  • If you are not getting similar output, then your actual data is different from the sample provided with this question @Andrew. Run the code on the sample provided or check https://ideone.com/fvSIGb for a working sample. – HatLess Jul 18 '22 at 08:08
  • https://ideone.com/DXzTCO its showing the correct output what i need...just need to know now how to create or edit the current file with the replaced xml file from result ? when i changed from -E to -i it throws an error in the sample – Andrew Jul 18 '22 at 08:26
  • @Andrew, if `-i` is not working, then try redirecting the output to a new file e.g `sed -E 's/[xl][me][li]f?(ns)?:(leif?=.*2016)?//g' file > new_file` – HatLess Jul 18 '22 at 08:28