Questions tagged [data-ingestion]

248 questions
8
votes
1 answer

Spark pulling data into RDD or dataframe or dataset

I'm trying to put into simple terms when spark pulls data through the driver, and then when spark doesn't need to pull data through the driver. I have 3 questions - Let's day you have a 20 TB flat file file stored in HDFS and from a driver…
uh_big_mike_boi
  • 3,350
  • 4
  • 33
  • 64
7
votes
1 answer

Using Snowpipe - What's the best practice for loading small files. eg. Thousands of 4K files per day?

Questions How much more expensive is it to load small files (eg. 4K) using Snowpipe than say 16K, 500K or 1-10Mb (the recommended file size). Note: This question implies it is more expensive to load small files rather than the recommended…
7
votes
3 answers

NiFi FlowFile Repository failed to update

I´m using Apache NiFi to ingest and preprocess some CSV files, but when runing during a long time, it always fails. The error is always the same: FlowFile Repository failed to update Searching at logs, I see this error always: 2018-07-11…
Jpf
  • 73
  • 1
  • 5
7
votes
0 answers

What is slowing down my PostgreSQL bulk import?

Because it's so easy to install on Debian stable, I decided to use PostgreSQL 9.6 to build a datawarehouse for some data I need to process. The first step is to load the data into the database with minimal transformations, mostly correcting some…
Rhymoid
  • 171
  • 4
7
votes
3 answers

Configure sink elasticsearch apache-flume

This is my first time here, so sorry if I don't post fine, and sorry for my bad English. I'm trying to configure Apache Flume and Elasticsearch sinks. Everything is ok, it seems that it runs fine, but there are 2 warnings when I start an agent; the…
Lifestorm
  • 91
  • 1
  • 6
5
votes
1 answer

Is Clickhouse Buffer Table appropriate for realtime ingestion of many small inserts?

I am writing an application that plots financial data and interacts with a realtime feed of such data. Due to the nature of the task, live market data may be received very frequently in one-trade-at-a-time fashion. I am using the database locally…
WhiteStork
  • 385
  • 1
  • 15
5
votes
2 answers

AWS Timestream: Unable to ingest records into AWS Timestream

As you all know, AWS Timestream was made generally available in the last week. Since then, I have been trying to experiment with it and understanding how it models and stores the data. I am facing an issue in ingesting records into Timestream. I…
ShwetaJ
  • 462
  • 1
  • 8
  • 32
5
votes
2 answers

Where to store shared cache objects in Cloud Run?

I am creating a data ingestion pipeline using Cloud Run. My Cloud Run api gets called everytime a file is dropped in a GCS bucket via Pub Sub. I need to load some metadata that contains text for the data I am ingesting. This metadata changes…
5
votes
2 answers

How do you ingest Spring boot logs directly into elastic

I’m investigating feasability of sending spring boot application logs directly into elastic search. Without using filebeats or logstash. I believe the Ingest plugin may help with this. My initial thoughts are to do this using logback over TCP. …
Robbo_UK
  • 11,351
  • 25
  • 81
  • 117
4
votes
3 answers

Data from event hub not getting populated in ADX database

I created a sample application to send events to the event hub, which subsequently sends data to the azure data explorer database. I can see the events appearing in the event hub, but the same is not getting ingested in the Azure Data Explorer…
4
votes
1 answer

Pandas: Merge two data frames and keep non-intersecting data from a single data frame

Desire: I want a way to merge two data frames and keep the non-intersected data from a specified data frame. Problem: I have duplicate data and I expected this line to remove that duplicate data: final_df =…
Brian Bruggeman
  • 5,008
  • 2
  • 36
  • 55
3
votes
2 answers

InfluxDB 2.0 Killed by OOM

I am very new to InfluxDB, Initially, I installed the 1.8 version but later upgraded to v2.0. I am treating this as an out-of-the-box approach, for now, I was able to set up the insertion into influx using…
droid 001
  • 31
  • 4
3
votes
2 answers

How can I read a table in a loosely structured text file into a data frame in R?

Take a look at the "Estimated Global Trend daily values" file on this NOAA web page. It is a .txt file with something like 50 header lines (identified with leading #s) followed by several thousand lines of tabular data. The link to download the file…
ulfelder
  • 5,305
  • 1
  • 22
  • 40
3
votes
2 answers

Cannot find class 'org.apache.hadoop.hive.druid.DruidStorageHandler'

The jar file for druid hive handler is there. Clients table is already there in hive with data. Filename in hive library folder hive-druid-handler-3.1.2.jar. I am getting the error an when I try to create table in hive for druid FAILED:…
Vishnu
  • 93
  • 1
  • 5
3
votes
2 answers

Google data fusion Execution error "INVALID_ARGUMENT: Insufficient 'DISKS_TOTAL_GB' quota. Requested 3000.0, available 2048.0."

I am trying load a Simple CSV file from GCS to BQ using Google Data Fusion Free version. The pipeline is failing with error . it reads com.google.api.gax.rpc.InvalidArgumentException: io.grpc.StatusRuntimeException: INVALID_ARGUMENT: Insufficient…
1
2 3
16 17