1

I'm working on a huge xml file and don't want to use XML software because my xml file exported from PubMed website has incorrect structure that changes from time to time so I'd like to remove some nodes of xml in notepad++ or ultraedit with regex. how to remove for example this whole line?

<ArticleId IdType="pii">S1806-83242018000100950</ArticleId>
David Larochette
  • 1,200
  • 10
  • 18
neurogen
  • 47
  • 4

3 Answers3

0

To remove every line with ArticleId element in the file, you need this regex :

^.*<ArticleId IdType="pii">.*$

This won't work if the ending tag is not on the same line.

David Larochette
  • 1,200
  • 10
  • 18
0

Use Perl regular expression search string:

^[\t ]*<ArticleId IdType="pii">.*</ArticleId>[\t ]*(?:\r?\n|\r|$)

This regular expression string searches

  • ^ ... from beginning of a line
  • [\t ]* ... for 0 or more horizontal tabs or spaces (optional leading tabs/spaces)
  • <ArticleId IdType="pii"> ... this string
  • .* ... any character 0 or more times except newline characters
  • </ArticleId> ... this string
  • [\t ]* ... for 0 or more horizontal tabs or spaces (optional trailing tabs/spaces)
  • (?:...) ... with a non marking group with an OR expression inside
  • \r?\n|\r|$ ... carriage return (optionally) and line-feed OR just carriage return OR end of line/file.

So (?:\r?\n|\r|$) matches

  • carriage return + line-feed which is the line ending in DOS/Windows text files,
  • or just line-feed which is the line ending in UNIX text files,
  • or just carriage return which is the line ending in MAC text files prior MAC OS X.

$ does not match line ending characters. It is just added in case of <ArticleId IdType="pii">.*</ArticleId> is also found at end of file with no line ending, i.e. the last line in file has no line ending.

Also possible would be the search string:

[\t ]*<ArticleId IdType="pii">.*</ArticleId>[\t ]*(?:\r?\n|\r)?

Now the XML element to remove could be also within a line containing another tag because of ^ for beginning of line removed and matching the line ending is just optionally. So it is not so line restrictive as the search expression above.

Mofi
  • 46,139
  • 17
  • 80
  • 143
-1

If you want to remove all lines with ArticleId without matter about by their content or attributes, you can simply search for this:

<ArticleId.+<\/ArticleId>