How to compare two parsed files and append the difference to the first?

Question

Suppose I have two .dat files; one on my computer and the other one on the other side of the earth - with data constantly being serialized into them through a QDataStream.

The data is parsed the same way – first some sort of ID and then an object associated with that particular ID.

QFile file("data.dat");
QDataStream stream(&file);

file.open("QIODevice::ReadWrite");

stream << *id*;          // ID goes in.
stream << *data_object*; // Object with interesting data is serialized into the file.

file.close();

After a while – the first one might look something like this (illustratory, not syntactically correct):

//-------------------------------------DATA.DAT------------------------------------//

ID:873482025
 dataObject

ID:129845379
 dataObject

ID:836482455
 dataObject

ID:224964811
 dataObject

ID:625444876
 dataObject

ID:215548669
 dataObject

//-------------------------------------DATA.DAT------------------------------------//

But the second one hasn't caught up quite yet.

//-------------------------------------DATA.DAT------------------------------------//

ID:873482025
 dataObject

ID:129845379
 dataObject

ID:836482455
 dataObject

//-------------------------------------DATA.DAT------------------------------------//

Is it possible to take both files – detect the differences between them and then "fuse" in the ones that are missing from the second but are present in the first?

Obviously this could be achieved by writing a function extracts the innards of the files, categorizes the contents individually, compares them and so forth – but is there a way to do this by just handling the files themselves, without having to parse the contents individually?

Possibly helpful: https://en.wikipedia.org/wiki/Delta_encoding — OMGtechy, Aug 03 '15 at 18:33
Given that the "parser" is trivial, and the data stream operators are already written, what's the problem? It seems like a very trivial thing to do, unless I'm missing something. — Kuba hasn't forgotten Monica, Aug 03 '15 at 18:51
Is there a 1:1 mapping between an ID and an object? That is, as long as the IDs are equal, are the objects guaranteed to be equal, too? — Kuba hasn't forgotten Monica, Aug 03 '15 at 18:52
When you say 1:1 mapped, do you mean "one object for every ID"? If so, yes – there's only one object for every ID. Each object has got a set of unique private variables associated with the data collected. It's not so much functionality, more "is it possible" - in a theoretical sense. — Quoi, Aug 03 '15 at 19:05
Sorry, now I understand what you mean: if two entries have the same ID, the information contained within the object will be invariably identical. — Quoi, Aug 03 '15 at 19:09

Kuba hasn't forgotten Monica · Accepted Answer · 2015-08-04T18:24:21.100

2

Read both files to extract Id sets.
Read one of the files while appending the objects with missing Ids to the other file.

You can leverage QSet to do set arithmetic. Also, each object would need not only the streaming operators, but also a skipObject static method. I'm also ignoring how you discriminate object types.

typedef qint32_t Id;

bool isOk(const QDataStream & str) { return str.status() == QDataStream::Ok; }

class Object {
  ...
public:
  static void skipObject(QDataStream & str) {
    qint8 format;
    str >> format;
    if (format == 0)
      str.skipRawData(32); // e.g. format 0 of this object is 32 bytes long
    ...
  }
};

QPair<QSet<Id>, bool> getIds(const QString & path) {
  QSet<Id> ids;
  QFile file(path);
  if (!file.open(QIODevice::ReadOnly)) return ids;
  QDataStream stream(&file);
  while (!stream.atEnd()) {
    stream >> id;
    Object::skipObject(stream);
    if (ids.contains(id))
      qWarning() << "duplicate id" << id << "in" << path;
    ids.insert(id);
  }
  return qMakePair(ids, isOk(stream));
}

bool copyIds(const QString & src, const QString & dst, const QSet<Id> & ids) {
  QFile fSrc(src), fDst(dst);
  if (! fSrc.open(QIODevice::ReadOnly)) return false;
  if (! fDst.open(QIODevice::WriteOnly | QIODevice::Append)) return false;
  QDataStream sSrc(&fSrc), sDst(&fDst);
  while (!sSrc.atEnd()) {
    Id id;
    sSrc >> id;
    if (ids.contains(id)) {
       Object object;
       sSrc >> object;
       sDst << id << object;
    } else
       Object::skipObject(sSrc);     
  }
  return isOk(sSrc) && isOk(sDst);
}

bool copyIds(const QString & src, const QString & dst) {
  auto idsSrc = getIds(src);
  auto idsDst = getIds(dst);
  if (!idsSrc.second || !idsDst.second) return false;
  auto ids = idsSrc.first - idsDst.first; 
  return copyIds(src, dst, ids);
}

edited Aug 04 '15 at 18:24

answered Aug 03 '15 at 19:19

Kuba hasn't forgotten Monica

95,931
16
151
313

Sidenote: What measures of optimization would you take if each object also contains something bigger, like a QImage? – Quoi Aug 03 '15 at 19:45
@Quoi Internally, your objects store their size I'm sure. You should be able to use `QDataStream::skipRawData` to jump over the objects you don't intend to keep. – Kuba hasn't forgotten Monica Aug 03 '15 at 20:17
@Quoi You might wish to refer to the *Serializing Qt Data Types* help topic, and to QImage's source code to understand how to skip over images. – Kuba hasn't forgotten Monica Aug 03 '15 at 20:34
Sorry to keep bantering on in the comments – but say I wanted to open the file and switch out an object with a certain ID for a new one. Would that be possible? – Quoi Aug 04 '15 at 09:11
@Quoi If the new object's size will be the same, then it's trivial. Seek past the id, and write a new object. Otherwise: 1. Re-write the ID with an invalid ID that means to ignore the object, and append the new object, with the proper ID, at the end. 2. Copy everything to a new file, while adding the object you desire. Finally, since this is getting complicated, it will be simpler to use sqlite and dump your objects as serialized blobs and let sqlite worry about making sure that the disk file doesn't get corrupt. Sqlite is very good at that - much better than any code you can write in a week. – Kuba hasn't forgotten Monica Aug 04 '15 at 16:59
I might switch over to SQL – I was just worrying about performance.. how much would it suffer, do you think? Also, what's the consensus on storing images (5MB) in an SQL blob? I would store them separately – but it's difficult without revealing the identity of the object it belongs to, everything is encrypted. – Quoi Aug 04 '15 at 18:19
Performance wise, since you have to do **everything** that a database does to store the blobs **safely**, you won't be any worse off unless you cheat and do things unsafely. Read about how sqlite is tested, and why (it's used in avionics, for example). I dare you to write storage code that's equally robust in face of interrupted writes and similar common failures. You can't and you won't, not without spending lots of time and effort. Performance will be the least of your concerns if your whole data store is corrupt. Doing "simple" file I/O safely is not simple at all. – Kuba hasn't forgotten Monica Aug 04 '15 at 18:22

How to compare two parsed files and append the difference to the first?

1 Answers1