Questions tagged [data-partitioning]

Data partitioning deals with the dividing of a collection of data into smaller collections of data for the purpose of faster processing, easier statistics gathering and smaller memory/persistence footprint.

337 questions
80
votes
14 answers

python equivalent of filter() getting two output lists (i.e. partition of a list)

Let's say I have a list, and a filtering function. Using something like >>> filter(lambda x: x > 10, [1,4,12,7,42]) [12, 42] I can get the elements matching the criterion. Is there a function I could use that would output two lists, one of elements…
F'x
  • 12,105
  • 7
  • 71
  • 123
71
votes
3 answers

Difference between df.repartition and DataFrameWriter partitionBy?

What is the difference between DataFrame repartition() and DataFrameWriter partitionBy() methods? I hope both are used to "partition data based on dataframe column"? Or is there any difference?
Shankar
  • 8,529
  • 26
  • 90
  • 159
49
votes
11 answers

C# - elegant way of partitioning a list?

I'd like to partition a list into a list of lists, by specifying the number of elements in each partition. For instance, suppose I have the list {1, 2, ... 11}, and would like to partition it such that each set has 4 elements, with the last set…
David Hodgson
  • 10,104
  • 17
  • 56
  • 77
35
votes
6 answers

What is the best way to divide a collection into 2 different collections?

I have a Set of numbers : Set mySet = [ 1,2,3,4,5,6,7,8,9] I want to divide it into 2 sets of odds and evens. My way was to use filter twice : Set set1 = mySet.stream().filter(y -> y % 2 ==…
user1386966
  • 3,302
  • 13
  • 43
  • 72
21
votes
5 answers

Create grouping variable for consecutive sequences and split vector

I have a vector, such as c(1, 3, 4, 5, 9, 10, 17, 29, 30) and I would like to group together the 'neighboring' elements that form a regular, consecutive sequence, i.e. an increase by 1, in a ragged vector resulting in: L1: 1 L2: 3,4,5 L3: 9,10 L4:…
letsrock
  • 211
  • 2
  • 3
21
votes
2 answers

Using jq how can I split a very large JSON file into multiple files, each a specific quantity of objects?

I have a large JSON file with I'm guessing 4 million objects. Each top level has a few levels nested inside. I want to split that into multiple files of 10000 top level objects each (retaining the structure inside each). jq should be able to do…
Chaz
  • 787
  • 2
  • 9
  • 16
18
votes
7 answers

QuickSort and Hoare Partition

I have a hard time translating QuickSort with Hoare partitioning into C code, and can't find out why. The code I'm using is shown below: void QuickSort(int a[],int start,int end) { int q=HoarePartition(a,start,end); if (end<=start) return; …
Ofek Ron
  • 8,354
  • 13
  • 55
  • 103
17
votes
2 answers

Querying Windows Azure Table Storage with multiple query criteria

I'm trying to query a table in Windows Azure storage and was initially using the TableQuery.CombineFilters in the TableQuery().Where function as follows: TableQuery.CombineFilters( TableQuery.GenerateFilterCondition("PartitionKey",…
Captain John
  • 1,859
  • 2
  • 16
  • 30
13
votes
5 answers

How to sort an integer array into negative, zero, positive part without changing relative position?

Give an O(n) algorithm which takes as input an array S, then divides S into three sets: negatives, zeros, and positives. Show how to implement this in place, that is, without allocating new memory. And you have to keep the number's relative…
Gin
  • 1,763
  • 3
  • 12
  • 17
11
votes
1 answer

What is the difference between partitioning and bucketing in Spark?

I try to optimize a join query between two spark dataframes, let's call them df1, df2 (join on common column "SaleId"). df1 is very small (5M) so I broadcast it among the nodes of the spark cluster. df2 is very large (200M rows) so I tried to…
nofar mishraki
  • 526
  • 1
  • 4
  • 15
11
votes
4 answers

How to write SQL query that selects distinct pair values for specific criteria?

I'm having trouble formulating a query for the following problem: For pair values that have a certain score, how do you group them in way that will only return distinct pair values with the best respective scores? For example, lets say I have a…
10
votes
5 answers

3D clustering Algorithm

Problem Statement: I have the following problem: There are more than a billion points in 3D space. The goal is to find the top N points which has largest number of neighbors within given distance R. Another condition is that the distance between any…
Teng Lin
  • 129
  • 1
  • 1
  • 6
10
votes
2 answers

Hashing VS Indexing

Both hashing and indexing are use to partition data on some pre- defined formula. But I am unable to understand the key difference between the two. As in hashing we are dividing the data on the basis of some key value pair, similarly in Indexing…
coolDude
  • 407
  • 1
  • 7
  • 17
10
votes
2 answers

partitioning an float array into similar segments (clustering)

I have an array of floats like this: [1.91, 2.87, 3.61, 10.91, 11.91, 12.82, 100.73, 100.71, 101.89, 200] Now, I want to partition the array like this: [[1.91, 2.87, 3.61] , [10.91, 11.91, 12.82] , [100.73, 100.71, 101.89] , [200]] // [200] will…
alessandro
  • 1,681
  • 10
  • 33
  • 54
9
votes
4 answers

python: Generating integer partitions

I need to generate all the partitions of a given integer. I found this algorithm by Jerome Kelleher for which it is stated to be the most efficient one: def accelAsc(n): a = [0 for i in range(n + 1)] k = 1 a[0] = 0 y = n - 1 …
etuardu
  • 5,066
  • 3
  • 46
  • 58
1
2 3
22 23