Questions tagged [large-data]

Large data is data that is difficult to process and manage because its size is usually beyond the limits of the software being used to perform the analysis.

A large amount of data. Although there is no exact number that defines "large" (this is probably because "large" is different depending on the situation: in the web, 1MB or 2MB might be large. In an application that is meant to clone hard drives 5TB might be large), a specific number is unnecessary as this tag is meant for questions regarding problems caused by too much data, so it doesn't matter how much that is.

2088 questions
1178
votes
16 answers

"Large data" workflows using pandas

I have tried to puzzle out an answer to this question for many months while learning pandas. I use SAS for my day-to-day work and it is great for it's out-of-core support. However, SAS is horrible as a piece of software for numerous other…
Zelazny7
  • 39,946
  • 18
  • 70
  • 84
117
votes
8 answers

What causes a Python segmentation fault?

I am implementing Kosaraju's Strong Connected Component(SCC) graph search algorithm in Python. The program runs great on small data set, but when I run it on a super-large graph (more than 800,000 nodes), it says "Segmentation Fault". What might be…
xiaolong
  • 3,396
  • 4
  • 31
  • 46
104
votes
5 answers

Shared memory in multiprocessing

I have three large lists. First contains bitarrays (module bitarray 0.8.0) and the other two contain arrays of integers. l1=[bitarray 1, bitarray 2, ... ,bitarray n] l2=[array 1, array 2, ... , array n] l3=[array 1, array 2, ... , array n] These…
FableBlaze
  • 1,785
  • 3
  • 16
  • 21
75
votes
4 answers

Append lines to a file

I'm new using R. I'm trying to add (append) new lines to a file with my existing data in R. The problem is that my data has about 30000 rows and 13000 cols. I already try to add a line with the writeLines function but the resulting file contains…
Sergio Vela
  • 751
  • 1
  • 5
  • 3
73
votes
4 answers

Parallel.ForEach can cause a "Out Of Memory" exception if working with a enumerable with a large object

I am trying to migrate a database where images were stored in the database to a record in the database pointing at a file on the hard drive. I was trying to use Parallel.ForEach to speed up the process using this method to query out the…
Scott Chamberlain
  • 124,994
  • 33
  • 282
  • 431
58
votes
3 answers

Is there any JSON viewer to open large json files (windows)?

I have very large JSON file which is of several GB. I am looking for any efficient JSON viewer. In which we are also able to view JSON in tree format. I understand such huge file can't be loaded in one go. I wonder is there any software to view JSON…
Anwar Shaikh
  • 1,591
  • 3
  • 22
  • 43
50
votes
2 answers

Red Black Tree versus B Tree

I have a project in which I have to achieve fast search, insert and delete operations on data ranging from megabytes to terabytes. I had been studying data structures of late and analyzing them. Being specific, I want to introduce 3 cases and ask…
swanar
  • 635
  • 1
  • 6
  • 10
50
votes
8 answers

What is the difference between laravel cursor and laravel chunk method?

I would like to know what is the difference between laravel chunk and laravel cursor method. Which method is more suitable to use? What will be the use cases for both of them? I know that you should use cursor to save memory but how it actually…
Suraj
  • 2,181
  • 2
  • 17
  • 25
45
votes
3 answers

How to efficiently write large files to disk on background thread (Swift)

Update I have resolved and removed the distracting error. Please read the entire post and feel free to leave comments if any questions remain. Background I am attempting to write relatively large files (video) to disk on iOS using Swift 2.0, GCD,…
Tommie C.
  • 12,895
  • 5
  • 82
  • 100
41
votes
3 answers

Writing large Pandas Dataframes to CSV file in chunks

How do I write out a large data files to a CSV file in chunks? I have a set of large data files (1M rows x 20 cols). However, only 5 or so columns of the data files are of interest to me. I want to make things easier by making copies of these files…
Korean_Of_the_Mountain
  • 1,428
  • 3
  • 16
  • 40
39
votes
2 answers

How to plot with a png as background?

I made a plot with a 3 million points and saved it as PNG. It took a few hours and I would like to avoid re-drawing all the points. How can I generate a new plot that has this PNG as a background?
Aleksandr Levchuk
  • 3,751
  • 4
  • 35
  • 47
32
votes
5 answers

How to read only lines that fulfil a condition from a csv into R?

I am trying to read a large csv file into R. I only want to read and work with some of the rows that fulfil a particular condition (e.g. Variable2 >= 3). This is a much smaller dataset. I want to read these lines directly into a dataframe, rather…
Hernan
  • 471
  • 1
  • 4
  • 8
31
votes
2 answers

D3: How to show large dataset

I've a large dataset comprises 10^5 data points. And now I'm considering the following question related to large dataset: Is there any efficient way to visualize very large dataset? In my case I have a user set and each user has 10^3 items. There…
SolessChong
  • 3,370
  • 8
  • 40
  • 67
25
votes
5 answers

Repeat NumPy array without replicating data?

I'd like to create a 1D NumPy array that would consist of 1000 back-to-back repetitions of another 1D array, without replicating the data 1000 times. Is it possible? If it helps, I intend to treat both arrays as immutable.
NPE
  • 486,780
  • 108
  • 951
  • 1,012
25
votes
5 answers

Mean value and standard deviation of a very huge data set

I am wondering if there is an algorithm that calculates the mean value and standard deviation of an unbound data set. for example, I am monitoring an measurement value, say, electric current. I would like to have the mean value of all historical…
Alfred Zhong
  • 6,773
  • 11
  • 47
  • 59
1
2 3
99 100