-2

I'm given a txt file with lots of </sub>symbols. For example,

Influence of Zn on the photoluminescence of colloidal (AgIn)<sub>x</sub>Zn<sub>2(1-x)</sub>S<sub>2</sub> nanocrystals.

Now I'm trying to use regex to extract the information above to txt file, but my ideal output is

Influence of Zn on the photoluminescence of colloidal (AgIn)xZn2(1-x)S2 nanocrystals

My current code can only extract the information with lots of <sub>, how to get the ideal output?

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
Josie G
  • 145
  • 5
  • "Now I'm trying to use regex to extract the information above to txt file" What tool are you using? What does your regular expression look like? – larsks Jul 21 '22 at 01:42
  • I'm using python, my expression is '(?<=\)([\s\S]+?)(?=\<\/ArticleTitle\>)' – Josie G Jul 21 '22 at 01:45
  • I would recommend you try to use an XML parser like [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/) rather than a brittle regular expression, see this [classic question](https://stackoverflow.com/q/1732348). – import random Jul 21 '22 at 01:50

1 Answers1

0

I am not sure if this is was what you had in mind, but you could use this little Python script I made to print out the lines without the <sub> and </sub> characters and use the output for further processing.

def main():
    remove_list = ['<sub>', '</sub>']

    with open('data.txt') as current_file:
        for i, line in enumerate(current_file):
            for item in remove_list:
                line = line.replace(item, '')
            print(line)

if __name__ == '__main__':
    main()

Just save this code as main.py and create a data.txt file in the same directory where you place the xml data.

Run the script as python3 main.py to display the output inside the terminal or you could run it as python3 main.py > output.txt to append the output to a text file.

Desmanado
  • 152
  • 9