Questions tagged [data-pipeline]
168 questions
28
votes
4 answers
Feeding .npy (numpy files) into tensorflow data pipeline
Tensorflow seems to lack a reader for ".npy" files.
How can I read my data files into the new tensorflow.data.Dataset pipline?
My data doesn't fit in memory.
Each object is saved in a separate ".npy" file. each file contains 2 different ndarrays as…

Sluggish Crow
- 383
- 1
- 3
- 5
21
votes
3 answers
How to access the response from Airflow SimpleHttpOperator GET request
I'm learning Airflow and have a simple question. Below is my DAG called dog_retriever:
import airflow
from airflow import DAG
from airflow.operators.http_operator import SimpleHttpOperator
from airflow.operators.sensors import HttpSensor
from…

Rachel Lanman
- 499
- 1
- 5
- 15
15
votes
1 answer
Is it possible to write a luigi wrapper task that tolerates failed sub-tasks?
I have a luigi task that performs some non-stable computations. Think of an optimization process that sometimes does not converge.
import luigi
MyOptimizer(luigi.Task):
input_param: luigi.Parameter()
output_filename =…

DalyaG
- 2,979
- 2
- 16
- 19
12
votes
1 answer
Implementing luigi dynamic graph configuration
I am new to luigi, came across it while designing a pipeline for our ML efforts. Though it wasn't fitted to my particular use case it had so many extra features I decided to make it fit.
Basically what I was looking for was a way to be able to…

Veltzer Doron
- 934
- 2
- 10
- 31
9
votes
1 answer
Truncate DynamoDb or rewrite data via Data Pipeline
There is possibility to dump DynamoDb via Data Pipeline and also import data in DynamoDb. Import is going well, but all the time data appends to already exists data in DynamoDb.
For now I found work examples that scan DynamoDb and delete items one…

Vladimir Gilevich
- 861
- 1
- 10
- 17
6
votes
1 answer
Pipeline from AWS RDS to S3 using Glue
I was trying AWS Glue to migrate our current data pipeline from python scripts to AWS Glue . I was able to setup a crawler to pull the schema for the different postgres databases . However, I am facing issues in pulling data from Postgres RDS to S3…

Eshank Jain
- 169
- 3
- 14
5
votes
2 answers
Is dvc.yaml supposed to be written or generated by dvc run command?
Trying to understand dvc, most tutorials mention generation of dvc.yaml by running dvc run command.
But at the same time, dvc.yaml which defines the DAG is also well documented. Also the fact that it is a yaml format and human readable/writable…

rajeshnair
- 1,587
- 16
- 32
5
votes
2 answers
Unable to find the relevant tensor remote_handle: Op ID: 14738, Output num: 0
I am using a colab pro TPU instance for the purpose of patch image classification.
i'm using tensorflow version 2.3.0.
When calling model.fit I get the following error: InvalidArgumentError: Unable to find the relevant tensor remote_handle: Op ID:…

Pooya448
- 63
- 4
5
votes
1 answer
Firehose datapipeline limitations
My use-case is as follows:
I have JSON data coming in which needs to be stored in S3 in parquet format. So far so good, I can create a schema in Glue and attach a "DataFormatConversionConfiguration" to my firehose stream. BUT the data is coming from…

Dexter
- 1,710
- 2
- 17
- 34
4
votes
1 answer
nested json from rest api to pyspark dataframe
I am trying to create a data pipeline where I request data from a REST API. The output is a nested json file which is great. I want to read the json file into a pyspark dataframe. This works fine when I save the file locally and use the following…

Saifullah Babrak
- 53
- 1
- 6
4
votes
2 answers
Dataflow with python flex template - launcher timeout
I'm trying to run my python dataflow job with flex template. job works fine locally when I run with direct runner (without flex template) however when I try to run it with flex template, job stuck in "Queued" status for a while and then fail with…

Kazuki
- 1,462
- 14
- 34
4
votes
2 answers
Bulk add ttl column to dynamodb table
I have a use case where I need to add ttl column to the existing table. Currently, this table has more than 2 billion records.
Is there any existing solution build around same? Or Should be emr is the path forward?

Vivek Goel
- 22,942
- 29
- 114
- 186
3
votes
1 answer
Data pipeline - Best approach to read data from network drive
Source: CSV files located in a shared drive(on Prem server). Access to this shared drive and folder is controlled using a security group.
Expectation: load CSV data into Google BigQuery table.
Is it possible to mount the network drive on Dataproc…

saravana ir
- 169
- 8
3
votes
2 answers
Window Functions in Apache Beam
Does anybody know how to performe a window function in apache beam (dataflow)?
Example:
Ex
ID Sector Country Income
1 Liam US 16133
2 Noah BR 10184
3 Oliver ITA 11119
4 Elijah FRA 13256
5 William GER 7722
6 James AUS 9786
7…

Bruno Vitti
- 41
- 3
3
votes
1 answer
Is there a way in airflow where a Daily DAG is dependent on weekly (on weekends) DAG?
I have these Dags DAG_A (runs daily) , DAG_B (runs mon-fri) and DAG_C (runs on sat and sun) where DAG_A is dependent on both DAG_B and DAG_C.
I tried setting the dependencies using External Task Sensor but everytime my scheduler stops running and…

Lalitha
- 31
- 2