remove line from xml with regex

Question

I'm working on a huge xml file and don't want to use XML software because my xml file exported from PubMed website has incorrect structure that changes from time to time so I'd like to remove some nodes of xml in notepad++ or ultraedit with regex. how to remove for example this whole line?

<ArticleId IdType="pii">S1806-83242018000100950</ArticleId>

You don't need regex to remove one single value. We need multiple lines to remove, to see what they have in common. — David Larochette, Jun 07 '18 at 15:52
sure but I need to remove many nodes with ArticleID IdType=... — neurogen, Jun 07 '18 at 15:54
i don't parse it...please tell me what regex code would be best for it — neurogen, Jun 07 '18 at 15:55
You want to remove every ArticleId element with a specific IdType attribute, together with their content ? — David Larochette, Jun 07 '18 at 15:56
yes David, i want to remove ArticleId IdType="pii"S1806-83242018000100950/ArticleId and ArticleId IdType="pii" S1806-83242018000100950/ArticleId and so on.all the lines — neurogen, Jun 07 '18 at 15:58
You should repost your question specifying you want to remove elements that don't have any sub-elements. — David Larochette, Jun 07 '18 at 16:05
i assume it would remove anything inside but I want to remove ALL THE LINES WITH :) — neurogen, Jun 07 '18 at 16:15
So you need to include end of line and start of line : ^.*.*$ — David Larochette, Jun 07 '18 at 16:55
Here you go, Find `"']|"[^"]*"|'[^']*')*?\sIdType\s*=\s*(?:(['"])\s*pii\s*\1))\s+(?:"[\S\s]*?"|'[\S\s]*?'|[^>]*?)+>[\S\s]*?` Replace with nothing. _Specific pii_. — , Jun 07 '18 at 19:12
Though as suggested, it's not a good idea to parse XML's with regex, but there is no harm in trying to do simple pattern match for some trivial work. You can start with simple lazy regexes like ---- **** — Koder101, Jun 07 '18 at 19:15

David Larochette · Answer 1 · 2018-06-07T19:45:09.577

0

To remove every line with ArticleId element in the file, you need this regex :

^.*<ArticleId IdType="pii">.*$

This won't work if the ending tag is not on the same line.

edited Jun 07 '18 at 19:45

answered Jun 07 '18 at 19:39

David Larochette

1,200
10
18

Mofi · Answer 2 · 2018-06-09T12:28:24.713

Use Perl regular expression search string:

^[\t ]*<ArticleId IdType="pii">.*</ArticleId>[\t ]*(?:\r?\n|\r|$)

This regular expression string searches

^ ... from beginning of a line
[\t ]* ... for 0 or more horizontal tabs or spaces (optional leading tabs/spaces)
<ArticleId IdType="pii"> ... this string
.* ... any character 0 or more times except newline characters
</ArticleId> ... this string
[\t ]* ... for 0 or more horizontal tabs or spaces (optional trailing tabs/spaces)
(?:...) ... with a non marking group with an OR expression inside
\r?\n|\r|$ ... carriage return (optionally) and line-feed OR just carriage return OR end of line/file.

So (?:\r?\n|\r|$) matches

carriage return + line-feed which is the line ending in DOS/Windows text files,
or just line-feed which is the line ending in UNIX text files,
or just carriage return which is the line ending in MAC text files prior MAC OS X.

$ does not match line ending characters. It is just added in case of <ArticleId IdType="pii">.*</ArticleId> is also found at end of file with no line ending, i.e. the last line in file has no line ending.

Also possible would be the search string:

[\t ]*<ArticleId IdType="pii">.*</ArticleId>[\t ]*(?:\r?\n|\r)?

Now the XML element to remove could be also within a line containing another tag because of ^ for beginning of line removed and matching the line ending is just optionally. So it is not so line restrictive as the search expression above.

score -1 · Answer 3 · answered Jun 07 '18 at 19:21

-1

If you want to remove all lines with ArticleId without matter about by their content or attributes, you can simply search for this:

<ArticleId.+<\/ArticleId>

answered Jun 07 '18 at 19:21

Francesco De Rosa

104
6

This won't remove the whole line – David Larochette Jun 07 '18 at 19:41

remove line from xml with regex

3 Answers3