Questions tagged [data-engineering]
69 questions
3
votes
2 answers
specify mysqlclient cflags and mysqlclient ldflags env vars manually while pip install apache-airflow-providers-mysql | Windows | no docker
I am trying to install a airflow provider in my virtual environment.
pip install apache-airflow-providers-mysql
error: subprocess-exited-with-error
× Getting requirements to build wheel did not run successfully.
│ exit code: 1
╰─> [24 lines…

masterShifu
- 31
- 1
2
votes
0 answers
Exclude dbt packages from documentation
How can I remove dbt_package name from projects in dbt documentation. I tried do mention in dbt_project.yml but with no luck.
dwh_airbnb_analytics:
+tags: hellodatabricks
+docs:
show: true
dbt_utils:
+docs:
show: false
…

Derik Roby
- 37
- 4
1
vote
3 answers
How to pass only necessary features to pipeline after SelectKBest
I have a regular tabular dataset, 100 features from the database are added
I want to push it into a regular sklearn.pipeline in which there will be preprocessing, encoding, some custom transformers, etc.
Penultimate estimator would be…

Nikitosiwe
- 33
- 6
1
vote
1 answer
I'm not receiving the keys in the Kafka content - Python/confluent
I'm trying to receive the message from the topic using Python script.
from confluent_kafka import Consumer, KafkaError
import uuid
# Kafka broker details
broker = "something"
topic = "something"
group = str(uuid.uuid4())
# Kafka consumer…

Aleksandar
- 84
- 1
- 4
1
vote
0 answers
TIKTOK API - Generate Access Token error message: 'Timestamp has expired.'
I am trying to pull tiktok data from the TikTok AP, and I am experiencing two problems currently:
1. I am having the error message: 'Timestamp has expired' even though I just renewed a new authentication code. I tried generating a timestamp…

Shi Yun
- 11
- 1
1
vote
1 answer
SQL Left Join in a many to many relationship situation
I am trying to collect some average survey scores on employees by location over time. These people can transfer locations over time for various reasons. I have 2 tables.
Survey_Scores: this table houses individual survey scores. It contains 3…

Logan Nielsen
- 13
- 2
1
vote
1 answer
Create bar graph of unpaid transactions having transaction created and payment date
I need to meet the following requirement and I am not sure what is the best way to do it.
At work, we have a table with transactions that may or may not have been paid. The date the transaction was made is indicated in the field created_date and the…

Ladislao Csulak
- 41
- 4
1
vote
1 answer
Ensuring Unique Dag ID on Apache Airflow
I'm setting up a Airflow Cluster to be used by multiple teams. Teams are working independently and the DAGs are built according to the need of the respective team.
I'm trying to ensure that DAG id of each DAGs should be unique. Teams may use some id…

ketankk
- 2,578
- 1
- 29
- 27
0
votes
0 answers
Vehicle sensor data/Telemetry AWS storage/streaming architecture setup
Dear community I have recently been task to build a cloud architecture for a vehicle telemetry system, preferably in AWS. There are going to be around 50-100 vehicles sending JSON messages via MQTT to a IoT Core broker every 10 seconds, each message…

colmo007
- 38
- 6
0
votes
0 answers
Nifi: DeleteAzureDataLakeStorage blocked
I have a DeleteAzureDataLakeStorage processor but in some executions it crashes.
In data provenance only view:
Sometimes it runs in seconds and other times it hangs indefinitely.
In the apache logs all the time the same thing appears:
That could…

Miguel
- 157
- 6
0
votes
1 answer
I have a Azure Stream analytics job. I want to make it process output every 15 minutes
I am getting some events from event hub and saving them in ADLS Gen 2 without performing any operation. Just saving the live events in ADLS Gen 2.
I am not doing any kind of sum, average or filteration.
I want my job to update my blob every 15…

GURMEET SINGH
- 11
- 1
0
votes
0 answers
Using .whl project in Pyspark Interactive Env
I have created a pyspark project and wrapped up in a .whl file which I am then using as a package to instantiate a interactive Pyspark Shell.
/pyspark --py-files sample-project.whl --name pyspark-test --jars…

Adidev-b
- 1
0
votes
0 answers
Debugging and troubleshooting singer-tap development for a SQL source using the Meltano SDK
Does anyone have experience developing SQL singer-taps using the Meltano SDK?
Problem
My biggest problem is troubleshooting the exceptions I am experiencing when testing my tap.
Context
I am developing in an Ubuntu environment running on a windows…
0
votes
0 answers
How to make Azure stream analytics write output every fix interval of 15 minutes
I have a Azure Stream analytics job.
I want to make my job write output to my database(ADLS Gen2) every 15 minutes.
Any suggestions on this
I tried to check tumbling window but i guess it is for counting of events in a fix time.

GURMEET SINGH
- 11
- 1
0
votes
1 answer
Trouble Retaining Changes and User Setup in Docker Container for Apache Airflow
I'm new to Docker and recently attempted to set up Apache Airflow within a Docker container following some online tutorials. Here's a breakdown of the steps I've taken:
I ran a Docker container using the command:docker run -it --rm -p 8888:8080…

Nabaraj Ghimire
- 1
- 2