Highest Voted 'data-pipeline' Questions

28

votes

4 answers

Feeding .npy (numpy files) into tensorflow data pipeline

Tensorflow seems to lack a reader for ".npy" files. How can I read my data files into the new tensorflow.data.Dataset pipline? My data doesn't fit in memory. Each object is saved in a separate ".npy" file. each file contains 2 different ndarrays as…

asked Feb 20 '18 at 16:08

Sluggish Crow

383
1
3
5

21

votes

3 answers

How to access the response from Airflow SimpleHttpOperator GET request

I'm learning Airflow and have a simple question. Below is my DAG called dog_retriever: import airflow from airflow import DAG from airflow.operators.http_operator import SimpleHttpOperator from airflow.operators.sensors import HttpSensor from…

airflow data-pipeline

asked Oct 10 '17 at 21:39

Rachel Lanman

499
1
5
15

15

votes

1 answer

Is it possible to write a luigi wrapper task that tolerates failed sub-tasks?

I have a luigi task that performs some non-stable computations. Think of an optimization process that sometimes does not converge. import luigi MyOptimizer(luigi.Task): input_param: luigi.Parameter() output_filename =…

python error-handling dataflow luigi data-pipeline

asked May 04 '20 at 16:04

DalyaG

2,979
2
16
19

12

votes

1 answer

Implementing luigi dynamic graph configuration

I am new to luigi, came across it while designing a pipeline for our ML efforts. Though it wasn't fitted to my particular use case it had so many extra features I decided to make it fit. Basically what I was looking for was a way to be able to…

python python-3.x luigi data-pipeline

asked Jun 26 '18 at 15:51

Veltzer Doron

934
2
10
31

9

votes

1 answer

Truncate DynamoDb or rewrite data via Data Pipeline

There is possibility to dump DynamoDb via Data Pipeline and also import data in DynamoDb. Import is going well, but all the time data appends to already exists data in DynamoDb. For now I found work examples that scan DynamoDb and delete items one…

amazon-dynamodb truncate amazon-data-pipeline data-pipeline

asked Feb 17 '17 at 16:04

Vladimir Gilevich

861
1
10
17

6

votes

1 answer

Pipeline from AWS RDS to S3 using Glue

I was trying AWS Glue to migrate our current data pipeline from python scripts to AWS Glue . I was able to setup a crawler to pull the schema for the different postgres databases . However, I am facing issues in pulling data from Postgres RDS to S3…

amazon-s3 amazon-rds amazon-athena aws-glue data-pipeline

asked Dec 11 '18 at 03:54

Eshank Jain

169
3
14

5

votes

2 answers

Is dvc.yaml supposed to be written or generated by dvc run command?

Trying to understand dvc, most tutorials mention generation of dvc.yaml by running dvc run command. But at the same time, dvc.yaml which defines the DAG is also well documented. Also the fact that it is a yaml format and human readable/writable…

directed-acyclic-graphs data-pipeline dvc

asked Jun 16 '21 at 14:19

rajeshnair

1,587
16
32

5

votes

2 answers

Unable to find the relevant tensor remote_handle: Op ID: 14738, Output num: 0

I am using a colab pro TPU instance for the purpose of patch image classification. i'm using tensorflow version 2.3.0. When calling model.fit I get the following error: InvalidArgumentError: Unable to find the relevant tensor remote_handle: Op ID:…

keras google-colaboratory tensorflow-datasets tpu data-pipeline

asked Nov 10 '20 at 15:30

Pooya448

63
4

5

votes

1 answer

Firehose datapipeline limitations

My use-case is as follows: I have JSON data coming in which needs to be stored in S3 in parquet format. So far so good, I can create a schema in Glue and attach a "DataFormatConversionConfiguration" to my firehose stream. BUT the data is coming from…

amazon-web-services bigdata amazon-kinesis-firehose data-pipeline

asked Apr 02 '19 at 14:29

Dexter

1,710
2
17
34

4

votes

1 answer

nested json from rest api to pyspark dataframe

I am trying to create a data pipeline where I request data from a REST API. The output is a nested json file which is great. I want to read the json file into a pyspark dataframe. This works fine when I save the file locally and use the following…

python apache-spark pyspark apache-spark-sql data-pipeline

asked Jul 07 '21 at 14:03

Saifullah Babrak

53
1
6

4

votes

2 answers

Dataflow with python flex template - launcher timeout

I'm trying to run my python dataflow job with flex template. job works fine locally when I run with direct runner (without flex template) however when I try to run it with flex template, job stuck in "Queued" status for a while and then fail with…

google-cloud-platform google-cloud-dataflow apache-beam data-pipeline

asked Nov 13 '20 at 00:14

Kazuki

1,462
14
34

4

votes

2 answers

Bulk add ttl column to dynamodb table

I have a use case where I need to add ttl column to the existing table. Currently, this table has more than 2 billion records. Is there any existing solution build around same? Or Should be emr is the path forward?

amazon-dynamodb emr amazon-emr amazon-data-pipeline data-pipeline

asked Feb 19 '18 at 22:15

Vivek Goel

22,942
29
114
186

3

votes

1 answer

Data pipeline - Best approach to read data from network drive

Source: CSV files located in a shared drive(on Prem server). Access to this shared drive and folder is controlled using a security group. Expectation: load CSV data into Google BigQuery table. Is it possible to mount the network drive on Dataproc…

apache-spark google-cloud-platform google-cloud-storage google-cloud-dataproc data-pipeline

asked Oct 09 '22 at 07:05

saravana ir

169
8

3

votes

2 answers

Window Functions in Apache Beam

Does anybody know how to performe a window function in apache beam (dataflow)? Example: Ex ID Sector Country Income 1 Liam US 16133 2 Noah BR 10184 3 Oliver ITA 11119 4 Elijah FRA 13256 5 William GER 7722 6 James AUS 9786 7…

google-cloud-platform bigdata apache-beam dataflow data-pipeline

asked Nov 09 '21 at 20:56

Bruno Vitti

41
3

3

votes

1 answer

Is there a way in airflow where a Daily DAG is dependent on weekly (on weekends) DAG?

I have these Dags DAG_A (runs daily) , DAG_B (runs mon-fri) and DAG_C (runs on sat and sun) where DAG_A is dependent on both DAG_B and DAG_C. I tried setting the dependencies using External Task Sensor but everytime my scheduler stops running and…

python-3.x airflow data-pipeline

asked May 29 '20 at 13:19

Lalitha

31
2

Questions tagged [data-pipeline]