Questions tagged [partitioning]

Partitioning is a performance strategy whereby you divide possibly very large groups of data into some number of smaller groups of data.

The expectation is that with algorithms of order exponentially greater than N the total time it takes to process the smaller groups and combine the results is still less than the time it would take to process the one larger set of data.

Partitioning is similar to range partitioning in many ways. As in partitioning by RANGE, each partition must be explicitly defined.

3138 questions

171

votes

13 answers

Is Zookeeper a must for Kafka?

In Kafka, I would like to use only a single broker, single topic and a single partition having one producer and multiple consumers (each consumer getting its own copy of data from the broker). Given this, I do not want the overhead of using…

asked May 20 '14 at 05:31

Paaji

2,139
4
14
11

145

votes

5 answers

How to define partitioning of DataFrame?

I've started using Spark SQL and DataFrames in Spark 1.4.0. I'm wanting to define a custom partitioner on DataFrames, in Scala, but not seeing how to do this. One of the data tables I'm working with contains a list of transactions, by account,…

scala apache-spark dataframe apache-spark-sql partitioning

asked Jun 23 '15 at 06:48

rake

2,348
3
15
11

votes

3 answers

How does HashPartitioner work?

I read up on the documentation of HashPartitioner. Unfortunately nothing much was explained except for the API calls. I am under the assumption that HashPartitioner partitions the distributed set based on the hash of the keys. For example if my data…

scala apache-spark rdd partitioning

asked Jul 15 '15 at 07:46

Sohaib

4,556
8
40
68

votes

17 answers

Efficient way to divide a list into lists of n size

I have an ArrayList, which I want to divide into smaller List objects of n size, and perform an operation on each. My current method of doing this is implemented with ArrayList objects in Java. Any pseudocode will do. for (int i = 1; i <=…

java arraylist partitioning

asked Apr 28 '11 at 20:49

Rowhawn

1,409
1
16
25

votes

5 answers

Pandas: Sampling a DataFrame

I'm trying to read a fairly large CSV file with Pandas and split it up into two random chunks, one of which being 10% of the data and the other being 90%. Here's my current attempt: rows = data.index row_count =…

python partitioning pandas

asked Aug 30 '12 at 06:12

Blender

289,723
53
439
496

votes

3 answers

What is MYSQL Partitioning?

I have read the documentation (http://dev.mysql.com/doc/refman/5.1/en/partitioning.html), but I would like, in your own words, what it is and why it is used. Is it mainly used for multiple servers so it doesn't drag down one server? So, part of…

mysql database partitioning

asked Oct 16 '09 at 19:23

TIMEX

259,804
351
777
1,080

votes

3 answers

Handling very large data with mysql

Sorry for the long post! I have a database containing ~30 tables (InnoDB engine). Only two of these tables, namely, "transaction" and "shift" are quite large (the first one have 1.5 million rows and shift has 23k rows). Now everything works fine and…

mysql database performance indexing partitioning

asked Sep 26 '16 at 10:23

mOna

2,341
9
36
60

votes

8 answers

MySQL Partitioning / Sharding / Splitting - which way to go?

We have an InnoDB database that is about 70 GB and we expect it to grow to several hundred GB in the next 2 to 3 years. About 60 % of the data belong to a single table. Currently the database is working quite well as we have a server with 64 GB of…

mysql partitioning database-performance sharding

asked Sep 05 '08 at 13:59

sme

5,673
7
32
30

votes

7 answers

LINQ Partition List into Lists of 8 members

How would one take a List (using LINQ) and break it into a List of Lists partitioning the original list on every 8th entry? I imagine something like this would involve Skip and/or Take, but I'm still pretty new to LINQ. Edit: Using C# / .Net…

linq partitioning skip take

asked Sep 22 '10 at 20:34

Pretzel

8,141
16
59
84

votes

3 answers

How to partition and write DataFrame in Spark without deleting partitions with no new data?

I am trying to save a DataFrame to HDFS in Parquet format using DataFrameWriter, partitioned by three column values, like this: dataFrame.write.mode(SaveMode.Overwrite).partitionBy("eventdate", "hour", "processtime").parquet(path) As mentioned in…

apache-spark apache-spark-sql partitioning parquet

asked Feb 18 '17 at 16:32

jaywilson

votes

1 answer

Determining optimal number of Spark partitions based on workers, cores and DataFrame size

There are several similar-yet-different concepts in Spark-land surrounding how work gets farmed out to different nodes and executed concurrently. Specifically, there is: The Spark Driver node (sparkDriverCount) The number of worker nodes available…

apache-spark apache-spark-sql distributed-computing partitioning bigdata

asked Sep 08 '16 at 00:57

smeeb

27,777
57
250
447

votes

5 answers

Table with 80 million records and adding an index takes more than 18 hours (or forever)! Now what?

A short recap of what happened. I am working with 71 million records (not much compared to billions of records processed by others). On a different thread, someone suggested that the current setup of my cluster is not suitable for my need. My table…

mysql database database-design partitioning

asked Sep 12 '10 at 17:23

Legend

113,822
119
272
400

votes

3 answers

How to understand the dynamic programming solution in linear partitioning?

I'm struggling to understand the dynamic programming solution to linear partitioning problem. I am reading the The Algorithm Design Manual and the problem is described in section 8.5. I've read the section countless times but I'm just not getting…

algorithm partitioning dynamic-programming

asked Oct 29 '11 at 12:08

Benedict Cohen

11,912
7
55
67

votes

4 answers

How many table partitions is too many in Postgres?

I'm partitioning a very large table that contains temporal data, and considering to what granularity I should make the partitions. The Postgres partition documentation claims that "large numbers of partitions are likely to increase query planning…

performance postgresql partitioning

asked May 24 '11 at 01:10

DNS

37,249
18
95
132

votes

1 answer

Avoid performance impact of a single partition mode in Spark window functions

My question is triggered by the use case of calculating the differences between consecutive rows in a spark dataframe. For example, I have: >>> df.show() +-----+----------+ |index| col1| +-----+----------+ | 0.0|0.58734024| | …

apache-spark pyspark apache-spark-sql partitioning window-functions

asked Dec 24 '16 at 13:00

Ytsen de Boer

2,797
2
25
36

2 3

…

99 100 Next