Questions tagged [large-data-volumes]
302 questions
75
votes
8 answers
Designing a web crawler
I have come across an interview question "If you were designing a web crawler, how would you avoid getting into infinite loops? " and I am trying to answer it.
How does it all begin from the beginning.
Say Google started with some hub pages say…

xyz
- 8,607
- 16
- 66
- 90
59
votes
12 answers
Using Hibernate's ScrollableResults to slowly read 90 million records
I simply need to read each row in a table in my MySQL database using Hibernate and write a file based on it. But there are 90 million rows and they are pretty big. So it seemed like the following would be appropriate:
ScrollableResults results =…

at.
- 50,922
- 104
- 292
- 461
36
votes
8 answers
Is it possible to change argv or do I need to create an adjusted copy of it?
My application has potentially a huge number of arguments passed in and I want to avoid the memory of hit duplicating the arguments into a filtered list. I would like to filter them in place but I am pretty sure that messing with argv array itself,…

ojblass
- 21,146
- 22
- 83
- 132
33
votes
8 answers
large amount of data in many text files - how to process?
I have large amounts of data (a few terabytes) and accumulating... They are contained in many tab-delimited flat text files (each about 30MB). Most of the task involves reading the data and aggregating (summing/averaging + additional…

hatmatrix
- 42,883
- 45
- 137
- 231
29
votes
9 answers
Plotting of very large data sets in R
How can I plot a very large data set in R?
I'd like to use a boxplot, or violin plot, or similar. All the data cannot be fit in memory. Can I incrementally read in and calculate the summaries needed to make these plots? If so how?

Daniel Arndt
- 2,268
- 2
- 17
- 22
24
votes
7 answers
Efficiently storing 7.300.000.000 rows
How would you tackle the following storage and retrieval problem?
Roughly 2.000.000 rows will be added each day (365 days/year) with the following information per row:
id (unique row identifier)
entity_id (takes on values between 1 and 2.000.000…

knorv
- 49,059
- 74
- 210
- 294
24
votes
2 answers
JDBC Batch Insert OutOfMemoryError
I have written a method insert() in which I am trying to use JDBC Batch for inserting half a million records into a MySQL database:
public void insert(int nameListId, String[] names) {
String sql = "INSERT INTO name_list_subscribers…

craftsman
- 15,133
- 17
- 70
- 86
21
votes
2 answers
Docker Data Volume Container - Can I share across swarm
I know how to create and mount a data volume container to multiple other containers using --volumes-from, but I do have a few questions regarding it's usage and limitations:
Situation: I am looking to use a data volume container to store user…

deankarn
- 462
- 2
- 6
- 17
21
votes
4 answers
what changes when your input is giga/terabyte sized?
I just took my first baby step today into real scientific computing today when I was shown a data set where the smallest file is 48000 fields by 1600 rows (haplotypes for several people, for chromosome 22). And this is considered tiny.
I write…

Wang
- 3,247
- 1
- 21
- 33
20
votes
4 answers
How to do page navigation for many, many pages? Logarithmic page navigation
What's the best way of displaying page navigation for many, many pages?
(Initially this was posted as a how-to tip with my answer included in the question. I've now split my answer off into the "answers" section below).
To be more…

Doin
- 7,545
- 4
- 35
- 37
20
votes
2 answers
Bad idea to transfer large payload using web services?
I gather that there basically isn't a limit to the amount of data that can be sent when using REST via a POST or GET. While I haven't used REST or web services it seems that most services involve transferring limited amounts of data. If you want…

Marcus Leon
- 55,199
- 118
- 297
- 429
18
votes
6 answers
How to avoid OOM (Out of memory) error when retrieving all records from huge table?
I am given a task to convert a huge table to custom XML file. I will be using Java for this job.
If I simply issue a "SELECT * FROM customer", it may return huge amount of data that eventually causing OOM. I wonder, is there a way i can process the…

janetsmith
- 8,562
- 11
- 58
- 76
16
votes
5 answers
Transferring large payloads of data (Serialized Objects) using wsHttp in WCF with message security
I have a case where I need to transfer large amounts of serialized object graphs (via NetDataContractSerializer) using WCF using wsHttp. I'm using message security and would like to continue to do so. Using this setup I would like to transfer…

jpierson
- 16,435
- 14
- 105
- 149
12
votes
11 answers
Fastest way to search a 1 GB+ string of data for the first occurrence of a pattern
There's a 1 gigabyte string of arbitrary data which you can assume to be equivalent to something like:
1_gb_string=os.urandom(1*gigabyte)
We will be searching this string, 1_gb_string, for an infinite number of fixed width, 1 kilobyte patterns,…

user213060
- 1,249
- 3
- 19
- 25
11
votes
7 answers
Fastest way for inserting very large number of records into a Table in SQL
The problem is, we have a huge number of records (more than a million) to be inserted into a single table from a Java application. The records are created by the Java code, it's not a move from another table, so INSERT/SELECT won't help.
Currently,…

Iravanchi
- 5,139
- 9
- 40
- 56