Questions tagged [partitioning]

Partitioning is a performance strategy whereby you divide possibly very large groups of data into some number of smaller groups of data.

Partitioning is a performance strategy whereby you divide possibly very large groups of data into some number of smaller groups of data.

The expectation is that with algorithms of order exponentially greater than N the total time it takes to process the smaller groups and combine the results is still less than the time it would take to process the one larger set of data.

Partitioning is similar to range partitioning in many ways. As in partitioning by RANGE, each partition must be explicitly defined.

3138 questions
171
votes
13 answers

Is Zookeeper a must for Kafka?

In Kafka, I would like to use only a single broker, single topic and a single partition having one producer and multiple consumers (each consumer getting its own copy of data from the broker). Given this, I do not want the overhead of using…
Paaji
  • 2,139
  • 4
  • 14
  • 11
145
votes
5 answers

How to define partitioning of DataFrame?

I've started using Spark SQL and DataFrames in Spark 1.4.0. I'm wanting to define a custom partitioner on DataFrames, in Scala, but not seeing how to do this. One of the data tables I'm working with contains a list of transactions, by account,…
rake
  • 2,348
  • 3
  • 15
  • 11
86
votes
3 answers

How does HashPartitioner work?

I read up on the documentation of HashPartitioner. Unfortunately nothing much was explained except for the API calls. I am under the assumption that HashPartitioner partitions the distributed set based on the hash of the keys. For example if my data…
Sohaib
  • 4,556
  • 8
  • 40
  • 68
71
votes
17 answers

Efficient way to divide a list into lists of n size

I have an ArrayList, which I want to divide into smaller List objects of n size, and perform an operation on each. My current method of doing this is implemented with ArrayList objects in Java. Any pseudocode will do. for (int i = 1; i <=…
Rowhawn
  • 1,409
  • 1
  • 16
  • 25
69
votes
5 answers

Pandas: Sampling a DataFrame

I'm trying to read a fairly large CSV file with Pandas and split it up into two random chunks, one of which being 10% of the data and the other being 90%. Here's my current attempt: rows = data.index row_count =…
Blender
  • 289,723
  • 53
  • 439
  • 496
68
votes
3 answers

What is MYSQL Partitioning?

I have read the documentation (http://dev.mysql.com/doc/refman/5.1/en/partitioning.html), but I would like, in your own words, what it is and why it is used. Is it mainly used for multiple servers so it doesn't drag down one server? So, part of…
TIMEX
  • 259,804
  • 351
  • 777
  • 1,080
58
votes
3 answers

Handling very large data with mysql

Sorry for the long post! I have a database containing ~30 tables (InnoDB engine). Only two of these tables, namely, "transaction" and "shift" are quite large (the first one have 1.5 million rows and shift has 23k rows). Now everything works fine and…
mOna
  • 2,341
  • 9
  • 36
  • 60
50
votes
8 answers

MySQL Partitioning / Sharding / Splitting - which way to go?

We have an InnoDB database that is about 70 GB and we expect it to grow to several hundred GB in the next 2 to 3 years. About 60 % of the data belong to a single table. Currently the database is working quite well as we have a server with 64 GB of…
sme
  • 5,673
  • 7
  • 32
  • 30
46
votes
7 answers

LINQ Partition List into Lists of 8 members

How would one take a List (using LINQ) and break it into a List of Lists partitioning the original list on every 8th entry? I imagine something like this would involve Skip and/or Take, but I'm still pretty new to LINQ. Edit: Using C# / .Net…
Pretzel
  • 8,141
  • 16
  • 59
  • 84
43
votes
3 answers

How to partition and write DataFrame in Spark without deleting partitions with no new data?

I am trying to save a DataFrame to HDFS in Parquet format using DataFrameWriter, partitioned by three column values, like this: dataFrame.write.mode(SaveMode.Overwrite).partitionBy("eventdate", "hour", "processtime").parquet(path) As mentioned in…
jaywilson
  • 431
  • 1
  • 4
  • 5
39
votes
1 answer

Determining optimal number of Spark partitions based on workers, cores and DataFrame size

There are several similar-yet-different concepts in Spark-land surrounding how work gets farmed out to different nodes and executed concurrently. Specifically, there is: The Spark Driver node (sparkDriverCount) The number of worker nodes available…
smeeb
  • 27,777
  • 57
  • 250
  • 447
35
votes
5 answers

Table with 80 million records and adding an index takes more than 18 hours (or forever)! Now what?

A short recap of what happened. I am working with 71 million records (not much compared to billions of records processed by others). On a different thread, someone suggested that the current setup of my cluster is not suitable for my need. My table…
Legend
  • 113,822
  • 119
  • 272
  • 400
33
votes
3 answers

How to understand the dynamic programming solution in linear partitioning?

I'm struggling to understand the dynamic programming solution to linear partitioning problem. I am reading the The Algorithm Design Manual and the problem is described in section 8.5. I've read the section countless times but I'm just not getting…
Benedict Cohen
  • 11,912
  • 7
  • 55
  • 67
32
votes
4 answers

How many table partitions is too many in Postgres?

I'm partitioning a very large table that contains temporal data, and considering to what granularity I should make the partitions. The Postgres partition documentation claims that "large numbers of partitions are likely to increase query planning…
DNS
  • 37,249
  • 18
  • 95
  • 132
32
votes
1 answer

Avoid performance impact of a single partition mode in Spark window functions

My question is triggered by the use case of calculating the differences between consecutive rows in a spark dataframe. For example, I have: >>> df.show() +-----+----------+ |index| col1| +-----+----------+ | 0.0|0.58734024| | …
1
2 3
99 100