I am trying to do fast time joins in spark - this problem has come up many times in my work, and I am yet to find a real solution for it. We have some kind of log files:
orders.2017.08.01.log
access.2017.08.01.log
other.2017.08.01.log
orders.2017.08.02.log
access.2017.08.02.log
Let's say we have 20 different types of files, all partitioned by date, all have timestamps in milliseconds. What we want is to create objects which combine all the events which happened at a certain second.
Doing a join
is too slow, and it gets slower the more files we want to join
. I was trying to do zip
instead, but that seems very artificial.
Ideally, the process should scale linearly to the number of files. Is the spark even the right tool for this kind of a job?
Exmple
orders.2017.08.01.log
timestamp, value1
12345, a
12346, b
access.2017.08.01.log
timestamp, value2
12345, c
12346, d
we want to get a dataframe like
timestamp, value1, value2
12345, a, c
12346, b, d
Data is partitioned by Date - logs files are usually joined between each other - that is orders.2017.08.01.log
is joined with access.2017.08.01.log
and other **.2017.08.01.log
files.
One common issue is that sometimes lines which occur around midnight end up in the wrong file. Otherwise - most of the lines for a certain date are in the file with that date.
I might be possible to process those files day by day and join them together, but this seems very cumbersome, for a very common problem. What would be an alternative tool to join log files?