Highest Voted 'data-ingestion' Questions

8

votes

1 answer

Spark pulling data into RDD or dataframe or dataset

I'm trying to put into simple terms when spark pulls data through the driver, and then when spark doesn't need to pull data through the driver. I have 3 questions - Let's day you have a 20 TB flat file file stored in HDFS and from a driver…

asked Aug 20 '16 at 03:38

uh_big_mike_boi

3,350
4
33
64

7

votes

1 answer

Using Snowpipe - What's the best practice for loading small files. eg. Thousands of 4K files per day?

Questions How much more expensive is it to load small files (eg. 4K) using Snowpipe than say 16K, 500K or 1-10Mb (the recommended file size). Note: This question implies it is more expensive to load small files rather than the recommended…

snowflake-cloud-data-platform data-ingestion near-real-time

asked Feb 11 '20 at 12:08

John Lion

71
1
4

7

votes

3 answers

NiFi FlowFile Repository failed to update

I´m using Apache NiFi to ingest and preprocess some CSV files, but when runing during a long time, it always fails. The error is always the same: FlowFile Repository failed to update Searching at logs, I see this error always: 2018-07-11…

apache-nifi journal data-ingestion

asked Jul 12 '18 at 08:30

Jpf

73
1
5

7

votes

0 answers

What is slowing down my PostgreSQL bulk import?

Because it's so easy to install on Debian stable, I decided to use PostgreSQL 9.6 to build a datawarehouse for some data I need to process. The first step is to load the data into the database with minimal transformations, mostly correcting some…

postgresql constraints data-ingestion

asked Jan 02 '18 at 16:47

Rhymoid

171
4

7

votes

3 answers

Configure sink elasticsearch apache-flume

This is my first time here, so sorry if I don't post fine, and sorry for my bad English. I'm trying to configure Apache Flume and Elasticsearch sinks. Everything is ok, it seems that it runs fine, but there are 2 warnings when I start an agent; the…

elasticsearch flume data-ingestion

asked Nov 16 '15 at 09:37

Lifestorm

91
1
6

5

votes

1 answer

Is Clickhouse Buffer Table appropriate for realtime ingestion of many small inserts?

I am writing an application that plots financial data and interacts with a realtime feed of such data. Due to the nature of the task, live market data may be received very frequently in one-trade-at-a-time fashion. I am using the database locally…

clickhouse data-ingestion

asked Sep 11 '21 at 22:02

WhiteStork

385
1
15

5

votes

2 answers

AWS Timestream: Unable to ingest records into AWS Timestream

As you all know, AWS Timestream was made generally available in the last week. Since then, I have been trying to experiment with it and understanding how it models and stores the data. I am facing an issue in ingesting records into Timestream. I…

amazon-web-services data-ingestion amazon-timestream

asked Oct 05 '20 at 13:25

ShwetaJ

462
1
8
32

5

votes

2 answers

Where to store shared cache objects in Cloud Run?

I am creating a data ingestion pipeline using Cloud Run. My Cloud Run api gets called everytime a file is dropped in a GCS bucket via Pub Sub. I need to load some metadata that contains text for the data I am ingesting. This metadata changes…

shared-memory google-cloud-run data-ingestion google-cloud-memorystore

asked Dec 23 '19 at 20:26

AIK DO

288
1
4
13

5

votes

2 answers

How do you ingest Spring boot logs directly into elastic

I’m investigating feasability of sending spring boot application logs directly into elastic search. Without using filebeats or logstash. I believe the Ingest plugin may help with this. My initial thoughts are to do this using logback over TCP. …

spring elasticsearch logging data-ingestion bigdata

asked Aug 10 '17 at 13:24

Robbo_UK

11,351
25
81
117

4

votes

3 answers

Data from event hub not getting populated in ADX database

I created a sample application to send events to the event hub, which subsequently sends data to the azure data explorer database. I can see the events appearing in the event hub, but the same is not getting ingested in the Azure Data Explorer…

azure azure-eventhub data-ingestion azure-data-explorer

asked Jul 22 '19 at 23:21

Neeraj

158
8

4

votes

1 answer

Pandas: Merge two data frames and keep non-intersecting data from a single data frame

Desire: I want a way to merge two data frames and keep the non-intersected data from a specified data frame. Problem: I have duplicate data and I expected this line to remove that duplicate data: final_df =…

python postgresql pandas sqlalchemy data-ingestion

asked Jun 27 '17 at 13:45

Brian Bruggeman

5,008
2
36
55

3

votes

2 answers

InfluxDB 2.0 Killed by OOM

I am very new to InfluxDB, Initially, I installed the 1.8 version but later upgraded to v2.0. I am treating this as an out-of-the-box approach, for now, I was able to set up the insertion into influx using…

php influxdb data-ingestion influxdb-2

asked Jun 30 '21 at 21:45

droid 001

31
4

3

votes

2 answers

How can I read a table in a loosely structured text file into a data frame in R?

Take a look at the "Estimated Global Trend daily values" file on this NOAA web page. It is a .txt file with something like 50 header lines (identified with leading #s) followed by several thousand lines of tabular data. The link to download the file…

r dataframe text data-ingestion

asked Feb 03 '20 at 15:40

ulfelder

5,305
1
22
40

3

votes

2 answers

Cannot find class 'org.apache.hadoop.hive.druid.DruidStorageHandler'

The jar file for druid hive handler is there. Clients table is already there in hive with data. Filename in hive library folder hive-druid-handler-3.1.2.jar. I am getting the error an when I try to create table in hive for druid FAILED:…

linux hive druid data-ingestion hiveddl

asked Dec 17 '19 at 10:43

Vishnu

93
1
5

3

votes

2 answers

Google data fusion Execution error "INVALID_ARGUMENT: Insufficient 'DISKS_TOTAL_GB' quota. Requested 3000.0, available 2048.0."

I am trying load a Simple CSV file from GCS to BQ using Google Data Fusion Free version. The pipeline is failing with error . it reads com.google.api.gax.rpc.InvalidArgumentException: io.grpc.StatusRuntimeException: INVALID_ARGUMENT: Insufficient…

google-cloud-platform data-processing data-ingestion google-cloud-data-fusion data-pipeline

asked Nov 22 '19 at 15:10

user11953315

33
1
3

Questions tagged [data-ingestion]