0

Say I have multiple CSV's in the following format:

2018/05/11T00:05:45,true,happy
2018/05/11T01:33:45,false,mad
2018/05/11T02:23:45,true,sleepy

Assume that duplicate rows exist across the collection of CSV files. I will be ingesting the data into elasticsearch, though not all at once. for example, I could ingest 3 CSV files today and 3 different files tomorrow. Further, I may not have access to tomorrows files yet but must ingest todays files today, therefore I can't do a diff on todays/tomorrows files. There could be duplicate rows across both sets of files, hence the need to generate an _id per row before ingest time to prevent duplicates in the elastic index.

Using Python how can I create a GUID for each row such that I could identify all of the duplicates?

user1040535
  • 201
  • 1
  • 3
  • 14
  • What do you mean by "create a GUID from"? Do you want to create a variant-1 or variant-2 UUID using the datetime from the row instead of the current datetime? – abarnert Jul 25 '18 at 19:08
  • 1
    How does creating a GUID for each row help you identify duplicates? – Blorgbeard Jul 25 '18 at 19:11
  • 1
    Do you even need a GUID? Create an empty `set` S. Then, as you go through the rows, convert each to a `tuple` and then check if it's in S. If it's not in S, it's unique and should be added to S. If it is in S, then it's a duplicate. – Jared Goguen Jul 25 '18 at 19:11
  • 1
    @Blorgbeard is right. Two identical values will still generate distinct GUIDs. That’s the point of GUIDs—they’re globally unique. – abarnert Jul 25 '18 at 19:13
  • 1
    I think you are thinking of `CHECKSUM`. I would concatenate all the values and then get their `CHECKSUM`. Look here: https://stackoverflow.com/questions/3431825/generating-an-md5-checksum-of-a-file to get started – sniperd Jul 25 '18 at 19:37
  • @sniperd There's really no need for a checksum here. Storing the data directly in a set, as suggested by Jared Goguen, should take around 46 bytes/line; storing MD5s instead should take around 36 bytes/line. Is it worth the extra CPU work, extra code complexity, and tiny but nonzero chance of a false positive to cut the small amount of storage by 20%? – abarnert Jul 25 '18 at 19:53
  • @abarnert you are right, I probably wouldn't use `CHECKSUM` now that I think about it. Really just wanted to point out that is what he was probably thinking about, not `GUID` In the end it's going to depend what he's doing with the data, maybe use `HASHBYTES`, sets, just `join` on the strings, who knows! :) – sniperd Jul 25 '18 at 19:55
  • @sniperd Yeah, that makes sense. Especially to a novice, GUID, MD5, etc are all just meaningless made-up words, so it’s easy to mix them up. (I remember mixing up RC2 and RSA-2048 in a question on Usenet, which got the same kind of baffled followups—Why do you want a pair of RC2 keys? How are they supposed to be related?) – abarnert Jul 25 '18 at 20:05
  • thanks for the feedback. I will be ingesting the data into elasticsearch, though not all at once. for example, I could ingest 3 CSV files today and 3 different files tomorrow and there could be duplicated across both sets, hence the need to desire to generate an _id per row before ingest time to prevent duplicates in the elastic index. – user1040535 Jul 25 '18 at 23:07
  • please see updated question above – user1040535 Jul 26 '18 at 14:17

0 Answers0