How to do fast and efficient time-joins in PySpark

Question

I am trying to do fast time joins in spark - this problem has come up many times in my work, and I am yet to find a real solution for it. We have some kind of log files:

orders.2017.08.01.log
access.2017.08.01.log
other.2017.08.01.log
orders.2017.08.02.log
access.2017.08.02.log

Let's say we have 20 different types of files, all partitioned by date, all have timestamps in milliseconds. What we want is to create objects which combine all the events which happened at a certain second.

Doing a join is too slow, and it gets slower the more files we want to join. I was trying to do zip instead, but that seems very artificial.

Ideally, the process should scale linearly to the number of files. Is the spark even the right tool for this kind of a job?

Exmple

orders.2017.08.01.log
timestamp, value1
12345, a
12346, b

access.2017.08.01.log
timestamp, value2
12345, c
12346, d

we want to get a dataframe like

timestamp, value1, value2
12345, a, c
12346, b, d

Data is partitioned by Date - logs files are usually joined between each other - that is orders.2017.08.01.log is joined with access.2017.08.01.log and other **.2017.08.01.log files.

One common issue is that sometimes lines which occur around midnight end up in the wrong file. Otherwise - most of the lines for a certain date are in the file with that date.

I might be possible to process those files day by day and join them together, but this seems very cumbersome, for a very common problem. What would be an alternative tool to join log files?

Can you provide a sample format of the contents? When you say `partitioned by date` what do you mean? Partitioned in HIVE/HDFS, by year then by month then by date? or partitions are reflected in the file names? — Bala, Dec 22 '17 at 13:46
In general - [joins in distributed systems should be used as a last resort](https://stackoverflow.com/q/46567578/6910411). _the process should scale linearly to the number of files_ - you might be able to get this on average depending on how much control you have over data collection process, who are the producers and consumers, acceptable error / data loss margin, latency bounds, available infrastructure and know-how. _Is the spark even the right tool for this kind of a job?_ - impossible to say given the information, and probably to broad for simple QA. Likely not. — zero323, Dec 22 '17 at 16:13
Well `joins` are really bad and slow, obviously, but what is an alternative? Example of the data appended to the question. — avloss, Dec 22 '17 at 18:20

How to do fast and efficient time-joins in PySpark

0 Answers0