-3

I'm trying to extract the English Name for the below XML code. I need to extract the Name value for language="eng" instead of language="chi".

May I know what is the Python regex that can help me to achieve it?

<?xml version="1.0" encoding="UTF-8"?>
 <BroadcastData creationDate="20150814232141">
     <ProviderInfo>
         <ProviderId>Profis</ProviderId>
         <ProviderName>ProfisLynx.</ProviderName>
     </ProviderInfo>
     <ScheduleData>
         <ChannelPeriod endTime="20150814233000" beginTime="20150814220000">
             <ChannelId>88</ChannelId>
             <Event duration="1800" beginTime="20150814220000">
                 <EventId>GR0018904021</EventId>
                 <DvbEventId>45481</DvbEventId>
                 <EventType>S</EventType>
                 <PreviewTime>0</PreviewTime>
                 <EpgProduction>
                     <EpgText language="eng">
                         <Name>Across The Strait</Name>
                         <Description>This programme looks at the happenings in Taiwan and its relationship with China. There'll be updated news on Taiwan and in-depth reports and discussions about current affairs issues in Taiwan.</Description>
                         <ExtendedInfo name="Contentid_ref">GR0018904021</ExtendedInfo>
                         <ExtendedInfo name="AudioTrack">chi</ExtendedInfo>
                         <ExtendedInfo name="Start_over_flag">0</ExtendedInfo>
                         <ExtendedInfo name="ProgrammeStatus">L</ExtendedInfo>
                     </EpgText>
                     <EpgText language="chi">
                         <Name>海峡两岸</Name>
                         <Description>丬央电视å°å”¯ä¸€çš„æ¶‰å°æ—¶äº‹æ–°é—»è¯„è®ºèŠ‚ç›®ã€‚èŠ‚ç›®å®—æ—¨æ˜¯è·Ÿè¸ªæµ·å³¡çƒ¬ç‚¹ï¼Œåæ˜ 两岸民æ„ï¼ŒæŠ¥å¯¼å½“æ—¥çš„è¿‘æœŸå°æ¹¾å²›å†…的烬点新闻,并对两岸å„个层é¢çš„交æµäº¤å¾€è¿›è¡Œè·Ÿè¸ªæŠ¥é“。</Description>
                         <ExtendedInfo name="AudioTrack">chi</ExtendedInfo>
                         <ExtendedInfo name="ProgrammeStatus">L</ExtendedInfo>
                     </EpgText>
                     <ParentalRating>0</ParentalRating>
                     <DvbContent>
                         <Content nibble2="0" nibble1="0"/>
                         <User nibble2="A" nibble1="0"/>
                     </DvbContent>
                     <DvbContent>
                         <Content nibble2="0" nibble1="0"/>
                         <User nibble2="0" nibble1="8"/>
                     </DvbContent>
                 </EpgProduction>
             </Event>
 ==============================================================
             <Event duration="1800" beginTime="20150814223000">
                 <EventId>GR0018906021</EventId>
                 <DvbEventId>45482</DvbEventId>
                 <EventType>S</EventType>
                 <PreviewTime>0</PreviewTime>
                 <EpgProduction>
                     <EpgText language="eng">
                         <Name>Asia Today</Name>
                         <Description>Tune in daily to receive the important news and latest social changes happening in Asia.</Description>
                         <ExtendedInfo name="Contentid_ref">GR0018906021</ExtendedInfo>
                         <ExtendedInfo name="AudioTrack">chi</ExtendedInfo>
                         <ExtendedInfo name="Start_over_flag">0</ExtendedInfo>
                         <ExtendedInfo name="ProgrammeStatus">L</ExtendedInfo>
                     </EpgText>
                     <EpgText language="chi">
                         <Name>今日亚洲</Name>
                         <Description>节目以亚洲人的视角报é“亚洲ã€ä¼ 达亚洲人的声音ã€å±•现亚洲的进æ-¥å’Œå‘展,以åŠåæ˜ äºšæ´²ä¸Žä¸–ç•Œå…¶ä»–åœ°åŒºçš„äº’åŠ¨ã€‚</Description>
                         <ExtendedInfo name="AudioTrack">chi</ExtendedInfo>
                         <ExtendedInfo name="ProgrammeStatus">L</ExtendedInfo>
                     </EpgText>
                     <ParentalRating>0</ParentalRating>
                     <DvbContent>
                         <Content nibble2="0" nibble1="0"/>
                         <User nibble2="A" nibble1="0"/>
                     </DvbContent>
                     <DvbContent>
                         <Content nibble2="0" nibble1="0"/>
                         <User nibble2="0" nibble1="8"/>
                     </DvbContent>
                 </EpgProduction>
             </Event>
 ==============================================================
             <Event duration="1800" beginTime="20150814230000">
                 <EventId>GR0018908021</EventId>
                 <DvbEventId>45483</DvbEventId>
                 <EventType>S</EventType>
                 <PreviewTime>0</PreviewTime>
                 <EpgProduction>
                     <EpgText language="eng">
                         <Name>China News</Name>
                         <Description>A news programme made especially to cater to the needs of overseas Chinese and potential investors. The content include China and international news and news analysis.</Description>
                         <ExtendedInfo name="Contentid_ref">GR0018908021</ExtendedInfo>
                         <ExtendedInfo name="AudioTrack">chi</ExtendedInfo>
                         <ExtendedInfo name="Start_over_flag">0</ExtendedInfo>
                         <ExtendedInfo name="ProgrammeStatus">L</ExtendedInfo>
                     </EpgText>
                     <EpgText language="chi">
                         <Name>丬国新闻</Name>
                         <Description>《丬国新闻》是以海外åŽäººã€æ¸¯æ¾³å°åŒèƒžã€ç•™å¬¦ç”Ÿã€é©»å¤–使领馆åŠä¸-èµ„æœºæž„äººå‘˜ä¸ºç›®æ ‡çš„æ–°é—»èŠ‚ç›®ã€‚èŠ‚ç›®ç”±å›½å†…å¤–è¦é—»ã€å†…åœ°ç»æµŽå’Œç¤¾ä¼šæ–°é—»ã€å¯¹å›½å†…外é‡è¦æ–°é—»äº‹ä»¶çš„分æžç»„æˆã€‚</Description>
                         <ExtendedInfo name="AudioTrack">chi</ExtendedInfo>
                         <ExtendedInfo name="ProgrammeStatus">L</ExtendedInfo>
                     </EpgText>
                     <ParentalRating>0</ParentalRating>
                     <DvbContent>
                         <Content nibble2="0" nibble1="0"/>
                         <User nibble2="A" nibble1="0"/>
                     </DvbContent>
                     <DvbContent>
                         <Content nibble2="0" nibble1="0"/>
                         <User nibble2="0" nibble1="8"/>
                     </DvbContent>
                 </EpgProduction>
             </Event>
 ==============================================================
         </ChannelPeriod>
     </ScheduleData>
 </BroadcastData>
==================================================================================================================
jonrsharpe
  • 115,751
  • 26
  • 228
  • 437
Adrian Tan
  • 33
  • 8

2 Answers2

0

You better not to parse XML with RegEx to avoid unexpected result.

Try this out -- How do I parse XML in Python?

Community
  • 1
  • 1
0

If text contains the XML you have provided, then the following RegEx would work:

print re.findall(r'<EpgText\s+language="eng">\s*<Name>(.*?)</Name>', text, re.M+re.I)

This would display the following three results:

['Across The Strait', 'Asia Today', 'China News']

It would though be much safer to parse the XML using an XML library.

Martin Evans
  • 45,791
  • 17
  • 81
  • 97