0

I have an XML file of the format:

<classes>

 <subject lb="Fall Sem 2020">
  <name>Operating System</name>
  <credit>3</credit>
  <type>Theory</type>
  <faculty>Prof. XYZ</faculty> 
 </subject>

 <subject lb="Spring Sem 2020">
  <name>Web Development</name>
  <credit>3</credit>
  <type>Lab</type>
 </subject>

 <subject lb="Fall Sem 2021">
  <name>Computer Network</name>
  <credit>3</credit>
  <type>Theory</type>
  <faculty>Prof. ABC</faculty> 
 </subject>

 <subject lb="Spring Sem 2021">
  <name>Software Engineering</name>
  <credit>3</credit>
  <type>Lab</type>
 </subject>

</classes>

Expected Output:

Fall Sem 2020
Spring Sem 2020
Fall Sem 2021
Spring Sem 2021

I want to extract the values of lb in an array.

My try: I tried using sed -n "/lb="/,\/"/p" file.xml but this command is not giving me the values present for the particular label.

What could be the correct way to deal with this problem?

RavinderSingh13
  • 130,504
  • 14
  • 57
  • 93
Bogota
  • 401
  • 4
  • 15
  • 1
    [Don't Parse XML/HTML With Regex.](https://stackoverflow.com/a/1732454/3776858) I suggest to use an XML/HTML parser (xmlstarlet, xmllint ...). – Cyrus Apr 15 '20 at 10:38
  • I know xmlstarlet is good to play with all xml related operation but currently I have some restrictions. That's why I'm using ```sed``` command. – Bogota Apr 15 '20 at 10:40

2 Answers2

2

Getting an attribute value in xml element.

If no XML parser is available. With GNU sed:

sed -En 's/.* lb="([^"]+)".*/\1/p' file

Output:

Fall Sem 2020
Spring Sem 2020
Fall Sem 2021
Spring Sem 2021
Cyrus
  • 84,225
  • 14
  • 89
  • 153
  • Also, I need some good tutorials on dealing with the regular expression that we give in our command. Will you recommend me some blogs/tutorials? – Bogota Apr 15 '20 at 10:56
  • 1
    [This](https://riptutorial.com/sed/example/8893/backreference) might help with the backreference (`\1`) I used and [this](http://www.skybert.net/unix/non-greedy-matching-in-sed/) with non-greedy matching (`[^"]+`). – Cyrus Apr 15 '20 at 10:59
  • Is there any way I can get this resultant in an array? I tried using ```arr=($(sed -En 's/.* lb="([^"]+)".*/\1/p' file))``` but this is giving me me individual word as an array element. – Bogota Apr 15 '20 at 12:32
  • I suggest to start a new question. – Cyrus Apr 15 '20 at 12:33
2

Could you please try following in awk considering that you don't have any way to use xml tools.

awk '
BEGIN{
  OFS=","
}
/<subject lb="/{
  match($0,/".*"/)
  print substr($0,RSTART+1,RLENGTH-2)
}
' Input_file
RavinderSingh13
  • 130,504
  • 14
  • 57
  • 93