Delete all article images using archivarix search & replace

Question

I am using Archivarix to restore a website and since it did not download any of the images, all images are now broken, so on using Archivarix there is this tool "search & replace" which uses regular expression

<a href="link" rel="bookmark">
<img width="840" height="450" src="path-to-image" class="entry-thumbnail wp-post-image" alt="">
</a>

I have no idea about regular expression, but I thought about if the regular expression targets the image tag that has class "wp-post-image" then maybe it could delete them all.

Searching the web, the only thing I found <img .*?> which searches the whole image tag.

The Archivarix regex dialect looks like it supports PCRE (silly help image available from post history).

You shouldn't use regex to parse HTML: https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 — Christian Baumann, Sep 30 '20 at 05:27

tripleee · Accepted Answer · 2020-09-30T06:03:31.047

1

If your example is representative, try

<img (?:[^<>]* )?class="(?:[^<>"]* )?wp-post-image(?: [^<>"]*)?"(?: [^<>]*)>

In brief, [^<>]* matches any string which doesn't contain < or >, and similarly [^<>"]* matches any string which also doesn't contain ". The grouping (?: ...)? says whatever is inside the parentheses is optional, and doesn't have to be there. With those, we can articulate an expression which says:

Match <img (with a space after)
Optionally, skip over as much as possible up until another space, followed by
class="
... optionally again skip up to a space before
wp-post-image
... optionally followed by more class names, followed by
"
optionally again followed by more attributes, followed by
>

The parts which don't have "optionally" up front are required. If your HTML is machine-generated it might be possible to come up with a stricter expression, but this should cope amicably with variations in the number of element attributes (alt="", width="480", etc) and their order, and more or fewer class names in the class= attribute.

edited Sep 30 '20 at 06:03

answered Sep 30 '20 at 05:56

tripleee

175,061
34
275
318

You say you want to replace the image tags, so that's what this does. If you want to remove the `...` around it too, that will be slightly more complex, but probably not too hard to articulate once you understand how this works. Your help image didn't reveal how to match a newline but I guess try with `\n` or simply a literal newline character if the tool lets you enter that. – tripleee Sep 30 '20 at 05:58
Yes, you said as much. That's why this regex requires that. – tripleee Sep 30 '20 at 06:00
ok, thanks, got rid of the images, one more thing, what if the class is dynamic like for example wp-image-12109 which has different number on the last part, how can I do that? – Francis Alvin Tan Sep 30 '20 at 06:59
`wp-image-\d+` matches `wp-image-` followed by one or more digts. Generally, anything which isn't a regex metacharacter (`. ^ $ \ * + ? [ ] ( ) { }`) simply matches itself. – tripleee Sep 30 '20 at 07:04

score 0 · Answer 2 · answered Oct 01 '20 at 11:23

Archivarix CMS has a tool called "Remove broken images". With a single click it will scan and remove all images of that website that were not restored. The same with "Remove broken links" tool. Click on Tools in the top menu. No need to use Search & Replace for that task.

PS: Are you sure you set the right time range for your download because if those images are present in Wayback Machine (WM), they will be restored. But if you set a narrow time range then only urls that were saved by WM during that period will be restored. It's important to understand how WA works and what those timestamps mean. Those time stamps are not related to a "website version" or a "page version". They mean the exact time when the exact url was saved.

Delete all article images using archivarix search & replace

2 Answers2