0

I have setup Kafka connect between my source and destination, for example

I have a table in mysql which I want to send to mongodb, I have setup mysql as source where as mongodb as sink and its working fine.

In my mysql table has a column called 'download_link', where I have a pdf s3 download link. Now when I setup Kafka this link will go mongodb but what I need is, after I receive message from mysql source, I want to execute a python code which downloads the pdf file and extract text from it, so when my data goes into mongodb. It shouldnt be link rather the text extracted. Is it possible to do something like this?

Can someone provide some resources how I can achieve this?

OneCricketeer
  • 179,855
  • 19
  • 132
  • 245
user_12
  • 1,778
  • 7
  • 31
  • 72

1 Answers1

2

I want to execute a python code ...

Kafka Connect cannot do this.

Since you have , refer post - Does Kafka python API support stream processing?

You would run your Python stream processor after the source connector, send data to new topic(s), then use a Connect sink on those


Keep in mind that Kafka messages have a maximum size, so extracting large PDF text blobs and persisting the data in the topic(s) might not be the best idea. Instead, you could have the MongoDB writer application download the PDF before writing to the database, but as stated, you'd need to write Java to use Kafka Connect for that. Otherwise, you're left with other Python processes that consume from Kafka and write to Mongo

OneCricketeer
  • 179,855
  • 19
  • 132
  • 245
  • Thank you for the information. "you're left with other Python processes that consume from Kafka and write to Mongo", can you explain a bit on this or provide any resources for me understand these better. – user_12 Jun 23 '21 at 00:29
  • 1
    I really liked Faust its really good and ease. I was able to modify the message from source connector and transform the data and push the modified to another topic and use that topic for sink connector. – user_12 Jun 23 '21 at 04:36
  • 1
    I meant you can use a regular `KafkaConsumer` from any Python Kafka library, then use a Mongo client like you would any other application – OneCricketeer Jun 23 '21 at 15:29
  • Is `Spark Streaming` a solution to these problems, since we have PySpark we can write code in python too, I read about it and see that it was build for that purpose, you get the data from kafka topic modify and push to another source? Is it true or should I stick with Faust? – user_12 Jun 26 '21 at 18:24
  • 1
    Spark _Structured_ Streaming can work, sure (the Streaming library is deprecated), if you already have a Spark cluster setup. Faust doesnt require a scheduler as I wrote in the other answer – OneCricketeer Jun 26 '21 at 21:46