0

I am trying to strip a huge XML file without containing all the useless information.The file will look something like this:

App_Data App="MOD" Name="Genre" Value="Series"/><App_Data App="MOD" 
Name="Show_Type" Value="Series"/><App_Data App="MOD" Name="Billing_ID" 
Value="10092"/><App_Data App="MOD" Name="Licensing_Window_Start" 
Value="2019-05-07 00:00:00"/><App_Data App="MOD" 
Name="Licensing_Window_End" Value="2019-05-13 23:59:59"/><App_Data 
App="MOD" Name="Preview_Period" Value="0"/><App_Data App="MOD" 
Name="Display_As_New" Value="4"/><App_Data App="MOD" 
Name="Display_As_Last_Chance" Value="7"/><App_Data App="MOD" 
Name="Provider_QA_Contact" Value="NBC Universal"/><App_Data App="MOD" 
Name="Suggested_Price" Value="0.00"/><App_Data App="MOD" 

I will need to find values for Show_Type, Licensing_Window_end, and Display_as_New

So, how can I turn my output string into something like this:

Name="Show_Type" Value="Series"
Name="Licensing_Window_End" Value="2019-05-13 23:59:59"
Name="Display_As_New" Value="4"

Currently, I have something like this:

  stripText(text) {
      return text.match(new RegExp("Show_Type" + "(.*)" + "/>"));

  }

But this only gets the first variable. and will include some useless information such as the /> ending part.

Emma
  • 27,428
  • 11
  • 44
  • 69
lakeIn231
  • 1,177
  • 3
  • 14
  • 34
  • 1
    Take a look here... https://stackoverflow.com/a/1732454/7692859 – Anis R. May 07 '19 at 21:19
  • If you really do need to parse attributes using regex, you can have a look at https://stackoverflow.com/questions/317053/regular-expression-for-extracting-tag-attributes – BlackPearl May 07 '19 at 21:20

4 Answers4

0

Technically you can convert the string to XML via the DOMParser and loop through it that way. You will need a few if statements for the correct attributes.

str = '<main><App_Data App="MOD" Name="Genre" Value="Series"/><App_Data App="MOD" Name="Show_Type" Value="Series"/><App_Data App="MOD" Name="Billing_ID" Value="10092"/><App_Data App="MOD" Name="Licensing_Window_Start" Value="2019-05-07 00:00:00"/><App_Data App="MOD" Name="Licensing_Window_End" Value="2019-05-13 23:59:59"/><App_Data App="MOD" Name="Preview_Period" Value="0"/><App_Data App="MOD" Name="Display_As_New" Value="4"/><App_Data App="MOD" Name="Display_As_Last_Chance" Value="7"/><App_Data App="MOD" Name="Provider_QA_Contact" Value="NBC Universal"/><App_Data App="MOD" Name="Suggested_Price" Value="0.00"/></main>';

let parser = new DOMParser();
let xmlDoc = parser.parseFromString(str, "text/xml");
let rows = xmlDoc.getElementsByTagName("App_Data");

for(z=0;z<rows.length;z++){
  console.log(rows[z].getAttribute("Name"),rows[z].getAttribute("Value"));
}
imvain2
  • 15,480
  • 1
  • 16
  • 21
0

This expression might help you to do so:

^(Name=")(Show_Type"|Licensing_Window_End"|Display_As_New")(\s+Value="[A-Za-z0-9-:\s]+")([\/>\s]+)(.*)$

I have added several boundaries just be safe. If you wish, you can reduce those boundaries. I have also added several capturing groups, to be easy to call.

enter image description here

Graph

This graph shows how the expression would work:

enter image description here

Boundaries

One way to reduce boundary could be to remove the name values, similar to this expression:

^(Name=")([A-Za-z\s\x22_]+)(\s+Value="[A-Za-z0-9-:\s]+")([\/>\s]+)(.*)$

enter image description here

Graph

enter image description here

Performance

This JavaScript snippet shows the performance of this expression using a simple 1-million times for loop on one of your inputs, you can simply perform a string replace on your inputs using $1$2$3.

    repeat = 1000000;
    start = Date.now();
    
    for (var i = repeat; i >= 0; i--) {
     var string = 'Name="Licensing_Window_End" Value="2019-05-13 23:59:59"/><App_Data';
     var regex = /^(Name=")(Show_Type"|Licensing_Window_End"|Display_As_New")(\s+Value="[A-Za-z0-9-:\s]+")([\/>\s]+)(.*)$/g;
     var match = string.replace(regex, "$1$2$3");
    }
    
    end = Date.now() - start;
    console.log("YAAAY! \"" + match + "\" is a match  ");
    console.log(end / 1000 + " is the runtime of " + repeat + " times benchmark test.  ");
Emma
  • 27,428
  • 11
  • 44
  • 69
0

would suggest to use a xml parser first, then remove the fields you want, then save the xml again. I would NOT recommend removing XML fields using a text string search, since xml is structured data, should use the right tool for the right job.

https://www.w3schools.com/xml/xml_parser.asp

tritium_3
  • 648
  • 1
  • 8
  • 17
0

I think it would must works:

const text = `App_Data App="MOD" Name="Genre" Value="Series"/><App_Data App="MOD" 
Name="Show_Type" Value="Series fasfdasdf"/><App_Data App="MOD" Name="Billing_ID" 
Value="10092"/><App_Data App="MOD" Name="Licensing_Window_Start" 
Value="2019-05-07 00:00:00"/><App_Data App="MOD" 
Name="Licensing_Window_End" Value="2019-05-13 23:59:59"/><App_Data 
App="MOD" Name="Preview_Period" Value="0"/><App_Data App="MOD" 
Name="Display_As_New" Value="4"/><App_Data App="MOD" 
Name="Display_As_Last_Chance" Value="7"/><App_Data App="MOD" 
Name="Provider_QA_Contact" Value="NBC Universal"/><App_Data App="MOD" 
Name="Suggested_Price" Value="0.00"/><App_Data App="MOD"`

const result = text.match(/[Nn]ame\="(Show_Type|Licensing_Window_End|Display_As_New)"\s+[Vv]alue\="[^"]*"/g)

console.log(result)

enter image description here

I don't know how you will consume this data, but probably you would find useful a model represented by an object where property "name" is the key and and property "values" is an array of the values (I duplicate the string shared and changed the duplicate values to get a better example):

enter image description here

const text = `App_Data App="MOD" Name="Genre" Value="Series"/><App_Data App="MOD" 
Name="Show_Type" Value="Series"/><App_Data App="MOD" Name="Billing_ID" 
Value="10092"/><App_Data App="MOD" Name="Licensing_Window_Start" 
Value="2019-05-07 00:00:00"/><App_Data App="MOD" 
Name="Licensing_Window_End" Value="2019-05-13 23:59:59"/><App_Data 
App="MOD" Name="Preview_Period" Value="0"/><App_Data App="MOD" 
Name="Display_As_New" Value="4"/><App_Data App="MOD" 
Name="Display_As_Last_Chance" Value="7"/><App_Data App="MOD" 
Name="Provider_QA_Contact" Value="NBC Universal"/><App_Data App="MOD" 
Name="Suggested_Price" Value="0.00"/><App_Data App="MOD"
App_Data App="MOD" Name="Genre" Value="Series"/><App_Data App="MOD" 
Name="Show_Type" Value="Series 2"/><App_Data App="MOD" Name="Billing_ID" 
Value="10092"/><App_Data App="MOD" Name="Licensing_Window_Start" 
Value="2019-05-07 00:00:00"/><App_Data App="MOD" 
Name="Licensing_Window_End" Value="2020-05-13 00:59:59"/><App_Data 
App="MOD" Name="Preview_Period" Value="0"/><App_Data App="MOD" 
Name="Display_As_New" Value="15"/><App_Data App="MOD" 
Name="Display_As_Last_Chance" Value="7"/><App_Data App="MOD" 
Name="Provider_QA_Contact" Value="NBC Universal"/><App_Data App="MOD" 
Name="Suggested_Price" Value="0.00"/><App_Data App="MOD" 
`

const result = text.match(/[Nn]ame\="(Show_Type|Licensing_Window_End|Display_As_New)"\s+[Vv]alue\="[^"]*"/g)

const objectResult = {
  show_type: [],
  licensing_window_end: [],
  display_as_new: [],
}

result.forEach((e)=>{
  const nameAndValue = e.match(/[Nn]ame\="(Show_Type|Licensing_Window_End|Display_As_New)"\s+[Vv]alue\="([^"]*)"/)
  switch (nameAndValue[1]) {
    case "Show_Type":
      objectResult.show_type.push(nameAndValue[2])
      break;
    case "Licensing_Window_End":
      objectResult.licensing_window_end.push(nameAndValue[2])
      break;
    case "Display_As_New":
      objectResult.display_as_new.push(nameAndValue[2])
      break;
    default:
      break;
  }
})

console.log(objectResult)
Pablo
  • 2,137
  • 1
  • 17
  • 16