0

I'm working on a project that will determine whether or not I score an internship. The project focuses on stream processing and is due in 2 weeks. It's pretty simple, just deriving some statistics from a csv file and printing it to a GUI. The project looks something like this:

A provided CSV is formatted as

ID: int, OperatingSystem: str, Date: str, Score: int

I'm supposed to track the lowest, highest, and median scores

  • per OS,
  • per date, and
  • across the entire dataset

Then I'm supposed to define a data structure for creating a histogram, also per date, OS, and entire dataset. I can use any language that I want, but I'd prefer Python if possible.

The problem is that I've never done any stream processing work before and I'm having trouble finding resources on how to actually put it into code. I've watched videos explaining kafka and looked into the docs and code samples for the faust and Maki Nage frameworks, but I've only gotten as far as crashing the program right off the bat and staring at doc pages scratching my head.

Are there any simple, well documented stream processing libraries that I should look into? Additionally, are there any resources that demonstrate how to actually write code for these libraries? Youtube seems to only focus on architectures and uml diagrams without any practical demonstrations, and I'm beginning to worry that I'll never understand how to build this project.

Thanks, Geisha

1 Answers1

0

This is just a point in the right direction it doesn't have to be a class you could also make a function with inner functions. You just need to persist some state.

This function will do the calculations for each line it reads.

# Remember to strip the header first

class Streamer:
    data = []

    date = {
        'high': 0,
        'low': 0,
        'median': 0,
    }

    os = {
        'high': 0,
        'low': 0,
        'median': 0,
    }

    score = {
        'high': 0,
        'low': 0,
        'median': 0,
    }

    def __init__(self)
        for line in open('file.csv', 'r'):
            es  = [x.strip() for x in line.strip().split(',')]

            x   = {
                'id'    : x[0],
                'os'    : x[1],
                'date'  : x[2],
                'score' : x[3],
            }

            self.calculate_os_median_high_low(x['os'])
            self.calculate_date_median_high_low(x['date'])
            self.calculate_score_median_high_low(x['score'])

            self.data.append(x)

    def calculate_os_median_high_low(self, os):
        pass

    def calculate_date_median_high_low(self, date):
        pass

    def calculate_score_median_high_low(self, score):
        pass

If you wanna be real clever then you could just feed the list for each line and run the reading concurrently, so that you can call the calculation functions from outside of the reading and thereby save alot of comutational engergy. (In this case I would use Golang instead since concurrency is 100 times easier and more safe in golang than in python)

mama
  • 2,046
  • 1
  • 7
  • 24