We use a clone of Amazon S3 called GreenQloud Storage. But they are shutting down so we have to take all the files here and migrate them in S3.
Our app is a little similar to a CMS in that our DB has fields that contains HTML fragments. And these fragments may reference GreenQloud urls, so we have to replace all these urls with S3 urls.
The files are already migrated. Here is a sample file in both storage providers:
- https://s.greenqloud.com/com.stample.s3/stample-1420827843028-spotlight.png
- https://stample-files.s3.amazonaws.com/stample-1420827843028-spotlight.png
I'm thinking of using a HTML parser to extract tags like a
, img
and http urls found in text nodes, but I'm frightened to miss some urls this way. Do you see any problem with that?
I'm also considering using regexes but some people advice against using regexes to parse HTML. But anyway, in my case I'm not really sure this could be considered "parsing HTML" as I just want to replace a pattern by another.
So, I would appreciate to know which solution is the best / most secure for this migration. I'm not concerned so much about migration throughput / performances but rather to migrate correctly all the links accurately.
We use Java/Scala, and all our fields to migrate are in MongoDB so any Java / MongoDB based snippets are welcome.
Also note that some old HTML fragments may not be well-formed in our DB, but a Java parser can generally fix that.
Thanks
Edit
A typical MongoDB document might look like:
{
_id: ObjectId(xxx),
title: "yyy",
content: "HTML FRAGMENT CONTAINING GREENQLOUD URLS",
mainPictureUrl: "GREENQLOUD URL"
}
I can't really give any example for the html fragment, as it can come in many different shapes.