Data partitioning deals with the dividing of a collection of data into smaller collections of data for the purpose of faster processing, easier statistics gathering and smaller memory/persistence footprint.
Questions tagged [data-partitioning]
337 questions
80
votes
14 answers
python equivalent of filter() getting two output lists (i.e. partition of a list)
Let's say I have a list, and a filtering function. Using something like
>>> filter(lambda x: x > 10, [1,4,12,7,42])
[12, 42]
I can get the elements matching the criterion. Is there a function I could use that would output two lists, one of elements…

F'x
- 12,105
- 7
- 71
- 123
71
votes
3 answers
Difference between df.repartition and DataFrameWriter partitionBy?
What is the difference between DataFrame repartition() and DataFrameWriter partitionBy() methods?
I hope both are used to "partition data based on dataframe column"? Or is there any difference?

Shankar
- 8,529
- 26
- 90
- 159
49
votes
11 answers
C# - elegant way of partitioning a list?
I'd like to partition a list into a list of lists, by specifying the number of elements in each partition.
For instance, suppose I have the list {1, 2, ... 11}, and would like to partition it such that each set has 4 elements, with the last set…

David Hodgson
- 10,104
- 17
- 56
- 77
35
votes
6 answers
What is the best way to divide a collection into 2 different collections?
I have a Set of numbers :
Set mySet = [ 1,2,3,4,5,6,7,8,9]
I want to divide it into 2 sets of odds and evens.
My way was to use filter twice :
Set set1 = mySet.stream().filter(y -> y % 2 ==…

user1386966
- 3,302
- 13
- 43
- 72
21
votes
5 answers
Create grouping variable for consecutive sequences and split vector
I have a vector, such as c(1, 3, 4, 5, 9, 10, 17, 29, 30) and I would like to group together the 'neighboring' elements that form a regular, consecutive sequence, i.e. an increase by 1, in a ragged vector resulting in:
L1: 1
L2: 3,4,5
L3: 9,10
L4:…

letsrock
- 211
- 2
- 3
21
votes
2 answers
Using jq how can I split a very large JSON file into multiple files, each a specific quantity of objects?
I have a large JSON file with I'm guessing 4 million objects. Each top level has a few levels nested inside. I want to split that into multiple files of 10000 top level objects each (retaining the structure inside each). jq should be able to do…

Chaz
- 787
- 2
- 9
- 16
18
votes
7 answers
QuickSort and Hoare Partition
I have a hard time translating QuickSort with Hoare partitioning into C code, and can't find out why. The code I'm using is shown below:
void QuickSort(int a[],int start,int end) {
int q=HoarePartition(a,start,end);
if (end<=start) return;
…

Ofek Ron
- 8,354
- 13
- 55
- 103
17
votes
2 answers
Querying Windows Azure Table Storage with multiple query criteria
I'm trying to query a table in Windows Azure storage and was initially using the TableQuery.CombineFilters in the TableQuery().Where function as follows:
TableQuery.CombineFilters(
TableQuery.GenerateFilterCondition("PartitionKey",…

Captain John
- 1,859
- 2
- 16
- 30
13
votes
5 answers
How to sort an integer array into negative, zero, positive part without changing relative position?
Give an O(n) algorithm which takes as input an array S, then divides S into three sets: negatives, zeros, and positives. Show how to implement this in place, that is, without allocating new memory. And you have to keep the number's relative…

Gin
- 1,763
- 3
- 12
- 17
11
votes
1 answer
What is the difference between partitioning and bucketing in Spark?
I try to optimize a join query between two spark dataframes, let's call them df1, df2 (join on common column "SaleId").
df1 is very small (5M) so I broadcast it among the nodes of the spark cluster.
df2 is very large (200M rows) so I tried to…

nofar mishraki
- 526
- 1
- 4
- 15
11
votes
4 answers
How to write SQL query that selects distinct pair values for specific criteria?
I'm having trouble formulating a query for the following problem:
For pair values that have a certain score, how do you group them in way that will only return distinct pair values with the best respective scores?
For example, lets say I have a…

Stephen Tableau
- 113
- 5
10
votes
5 answers
3D clustering Algorithm
Problem Statement:
I have the following problem:
There are more than a billion points in 3D space. The goal is to find the top N points which has largest number of neighbors within given distance R. Another condition is that the distance between any…

Teng Lin
- 129
- 1
- 1
- 6
10
votes
2 answers
Hashing VS Indexing
Both hashing and indexing are use to partition data on some pre- defined formula. But I am unable to understand the key difference between the two.
As in hashing we are dividing the data on the basis of some key value pair, similarly in Indexing…

coolDude
- 407
- 1
- 7
- 17
10
votes
2 answers
partitioning an float array into similar segments (clustering)
I have an array of floats like this:
[1.91, 2.87, 3.61, 10.91, 11.91, 12.82, 100.73, 100.71, 101.89, 200]
Now, I want to partition the array like this:
[[1.91, 2.87, 3.61] , [10.91, 11.91, 12.82] , [100.73, 100.71, 101.89] , [200]]
// [200] will…

alessandro
- 1,681
- 10
- 33
- 54
9
votes
4 answers
python: Generating integer partitions
I need to generate all the partitions of a given integer.
I found this algorithm by Jerome Kelleher for which it is stated to be the most efficient one:
def accelAsc(n):
a = [0 for i in range(n + 1)]
k = 1
a[0] = 0
y = n - 1
…

etuardu
- 5,066
- 3
- 46
- 58