Modify all urls from HTML fragments (migration to S3)

Question

We use a clone of Amazon S3 called GreenQloud Storage. But they are shutting down so we have to take all the files here and migrate them in S3.

Our app is a little similar to a CMS in that our DB has fields that contains HTML fragments. And these fragments may reference GreenQloud urls, so we have to replace all these urls with S3 urls.

The files are already migrated. Here is a sample file in both storage providers:

I'm thinking of using a HTML parser to extract tags like a, img and http urls found in text nodes, but I'm frightened to miss some urls this way. Do you see any problem with that?

I'm also considering using regexes but some people advice against using regexes to parse HTML. But anyway, in my case I'm not really sure this could be considered "parsing HTML" as I just want to replace a pattern by another.

So, I would appreciate to know which solution is the best / most secure for this migration. I'm not concerned so much about migration throughput / performances but rather to migrate correctly all the links accurately.

We use Java/Scala, and all our fields to migrate are in MongoDB so any Java / MongoDB based snippets are welcome.

Also note that some old HTML fragments may not be well-formed in our DB, but a Java parser can generally fix that.

Thanks

Edit

A typical MongoDB document might look like:

{
  _id: ObjectId(xxx),
  title: "yyy",
  content: "HTML FRAGMENT CONTAINING GREENQLOUD URLS",
  mainPictureUrl: "GREENQLOUD URL"
}

I can't really give any example for the html fragment, as it can come in many different shapes.

Whatever solution you go for, maybe it would be worth writing a test to validate the output afterwards to ensure that all of/some sample of the links will work. — Andy Turner, Sep 18 '15 at 16:39
yes @AndyTurner it will be tested against a DB stample but we can't test manually on the whole DB. And for the HTML fragments I'll log all fragments that will still contain case insensitively "greenqloud" to check them manually. — Sebastien Lorber, Sep 18 '15 at 16:47

Modify all urls from HTML fragments (migration to S3)

0 Answers0