Is it possible to create a GUID from a CSVs rows in Python?

Question

Say I have multiple CSV's in the following format:

2018/05/11T00:05:45,true,happy
2018/05/11T01:33:45,false,mad
2018/05/11T02:23:45,true,sleepy

Assume that duplicate rows exist across the collection of CSV files. I will be ingesting the data into elasticsearch, though not all at once. for example, I could ingest 3 CSV files today and 3 different files tomorrow. Further, I may not have access to tomorrows files yet but must ingest todays files today, therefore I can't do a diff on todays/tomorrows files. There could be duplicate rows across both sets of files, hence the need to generate an _id per row before ingest time to prevent duplicates in the elastic index.

Using Python how can I create a GUID for each row such that I could identify all of the duplicates?

What do you mean by "create a GUID from"? Do you want to create a variant-1 or variant-2 UUID using the datetime from the row instead of the current datetime? — abarnert, Jul 25 '18 at 19:08
How does creating a GUID for each row help you identify duplicates? — Blorgbeard, Jul 25 '18 at 19:11
Do you even need a GUID? Create an empty `set` S. Then, as you go through the rows, convert each to a `tuple` and then check if it's in S. If it's not in S, it's unique and should be added to S. If it is in S, then it's a duplicate. — Jared Goguen, Jul 25 '18 at 19:11
@Blorgbeard is right. Two identical values will still generate distinct GUIDs. That’s the point of GUIDs—they’re globally unique. — abarnert, Jul 25 '18 at 19:13
I think you are thinking of `CHECKSUM`. I would concatenate all the values and then get their `CHECKSUM`. Look here: https://stackoverflow.com/questions/3431825/generating-an-md5-checksum-of-a-file to get started — sniperd, Jul 25 '18 at 19:37
@sniperd There's really no need for a checksum here. Storing the data directly in a set, as suggested by Jared Goguen, should take around 46 bytes/line; storing MD5s instead should take around 36 bytes/line. Is it worth the extra CPU work, extra code complexity, and tiny but nonzero chance of a false positive to cut the small amount of storage by 20%? — abarnert, Jul 25 '18 at 19:53
@abarnert you are right, I probably wouldn't use `CHECKSUM` now that I think about it. Really just wanted to point out that is what he was probably thinking about, not `GUID` In the end it's going to depend what he's doing with the data, maybe use `HASHBYTES`, sets, just `join` on the strings, who knows! :) — sniperd, Jul 25 '18 at 19:55
@sniperd Yeah, that makes sense. Especially to a novice, GUID, MD5, etc are all just meaningless made-up words, so it’s easy to mix them up. (I remember mixing up RC2 and RSA-2048 in a question on Usenet, which got the same kind of baffled followups—Why do you want a pair of RC2 keys? How are they supposed to be related?) — abarnert, Jul 25 '18 at 20:05
thanks for the feedback. I will be ingesting the data into elasticsearch, though not all at once. for example, I could ingest 3 CSV files today and 3 different files tomorrow and there could be duplicated across both sets, hence the need to desire to generate an _id per row before ingest time to prevent duplicates in the elastic index. — user1040535, Jul 25 '18 at 23:07

Is it possible to create a GUID from a CSVs rows in Python?

0 Answers0