2

I am using python and when indexing documents (for search engine) it takes a lot of RAM, after i stop the indexing process the memory is still full (like 8gb of RAM). This is bad because i need my search engine to work all the time and not to reset the OS when i finished indexing. Is there any efficient way how to manage with huge arrays,dictionaries and lists, and how to free them. Any ideas?

I saw also some questions about it on stackoverflow, but they are old:

Python memory footprint vs. heap size

Profile Memory Allocation in Python (with support for Numpy arrays)

Info:

free -t
             total       used       free     shared    buffers     cached
Mem:          5839       5724        114          0         15       1011
-/+ buffers/cache:       4698       1141
Swap:         1021        186        835
Total:        6861       5910        950


top | grep python 

 3164 root      20   0 68748  31m 1404 R   17  0.5  53:43.89 python                                                                     
 6716 baddc0re  20   0 84788  30m 1692 S    0  0.5   0:06.81 python     

 ps aux | grep python

root      3164 57.1  0.4  64876 29824 pts/0    R+   May27  54:23 python SE_doc_parse.py
baddc0re  6693  0.0  0.2  53240 16224 pts/1    S+   00:46   0:00 python index.py

uptime

01:02:40 up  1:43,  3 users,  load average: 1.22, 1.46, 1.39


sysctl vm.min_free_kbytes

vm.min_free_kbytes = 67584

The real problem is when i start the script the indexing is fast, but when the usage is increasing it is getting slower.

Document wikidoc_18784 added on 2012-05-28 01:03:46 "fast"
wikidoc_18784
-----------------------------------
Document wikidoc_21934 added on 2012-05-28 01:04:00 "slower"
wikidoc_21934
-----------------------------------
Document wikidoc_22903 added on 2012-05-28 01:04:01 "slower"
wikidoc_22903
-----------------------------------
Document wikidoc_20274 added on 2012-05-28 01:04:10 "slower"
wikidoc_20274
-----------------------------------
Document wikidoc_23013 added on 2012-05-28 01:04:53  "even more slower"
wikidoc_23013

The size of the documents is one or two pages of text maximum. The indexing of 10 pages takes about 2-3 seconds.

Tnx everyone for the help :)

Community
  • 1
  • 1
badc0re
  • 3,333
  • 6
  • 30
  • 46
  • You forgot to say what the problem is. What happens if you don't reset the OS? Does something crash? Or run slowly? Or what? – David Schwartz May 28 '12 at 08:48
  • Well everything is slow. The performance of the search engine decrees. – badc0re May 28 '12 at 08:50
  • You kind of need to describe the problem then. Nobody reading your question would have any clue that this is a search engine performance problem. What remains slow after the indexing is finished? Just Python or the system as a whole? And is the CPU largely idle while it's slow? What OS? What do the system memory stats look like? – David Schwartz May 28 '12 at 08:50
  • I have wrote "indexing documents (for search engine)" and i said that the whole system is slow. Linux ubuntu 11.10 is the OS. – badc0re May 28 '12 at 08:51
  • 1
    Okay, when the system is slow, what is the output of `free`? And what is the output of `uptime`? – David Schwartz May 28 '12 at 08:54
  • i have added the information. – badc0re May 28 '12 at 09:04
  • It looks like SE_doc_parse is slamming the CPU. Free memory is above what the system needs but not abnormal. The page cache doesn't appear to be squeezed. It just looks like the CPU doing lots of work. – David Schwartz May 28 '12 at 09:22
  • hmm so you think that the main reason is the processor performance? i am using 1055t amd – badc0re May 28 '12 at 09:26
  • let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/11815/discussion-between-david-schwartz-and-badc0re) – David Schwartz May 28 '12 at 09:27
  • Just as a side note, your trading time for memory right? Sure a search engine is going to punch ram really damn hard anyway... – Jakob Bowyer May 28 '12 at 09:42
  • I don't know if you are discarding data, if you are, try multiprocessing it can help relieve different cores. – Sam Moldenha Sep 09 '20 at 07:39

3 Answers3

3

Your issue can't possibly be related to too much memory use. The more memory the system uses, the faster it runs. That's why we add memory to a system to improve its performance. If you think that using less memory will somehow make the system faster, take some memory out. That will force it to use less memory. But, not surprisingly, it will be slower if you do that.

The system keeps memory in use because it takes effort to make memory free. And there is no benefit, since free memory doesn't do anything. It's not like if you use half as much today, you can use twice as much tomorrow. If the system needs memory for something, it can easily just move memory directly from one use to another -- it doesn't need a lot of memory sitting around free.

Modern operating systems only keep a small amount of memory free to cope with certain types of unusual cases where they can't transition memory from one use to another. On Linux, you can find out how much free memory the system needs with this command: sysctl vm.min_free_kbytes. You'll probably find that's roughly how much free memory you have -- and that's good, because that's what the system needs.

So you don't need or want to free memory. You want to figure out why your system is slow.

Update: From your new information, it looks like SE_doc_parse.py is slamming the CPU hard. I would look at optimizing that code, if possible.

Update: Seems it was an inefficient dictionary algorithm being used above the sizes it was intended to scale to and hogging the CPU.

David Schwartz
  • 179,497
  • 17
  • 214
  • 278
  • This answer should be qualified: swapping should be included in the picture. When memory swaps from RAM to disk, programs do run slower. It is important to qualify "The more memory the system uses, the faster it runs." – Eric O. Lebigot May 28 '12 at 09:28
  • @EOL: Swapping is not an exception. The more physical memory the system uses, the less it swaps. Even when the system is swapping, the more physical memory is uses, the faster is runs. – David Schwartz May 28 '12 at 09:32
  • True, but the original poster was arguably thinking about the memory used by his *program*, not the memory available on his computer, so your remark can be understood as "the more physical memory the program uses, the less it swaps". Granted, your remark is technically correct, but addressing the original poster's concerns directly instead of making a side remark about a computer's RAM would be less confusing. – Eric O. Lebigot May 28 '12 at 09:34
  • @EOL: He says, "after i stop the indexing process the memory is still full". How can that be about memory used by his program? And "the more physical memory the program uses, the less it swaps" is *correct*. All other things being equal, if you force a program to use less physical memory, it will swap more. (Test it. Take physical memory out, forcing the program to use less, and see what happens to performance.) Physical memory use is *good*. If you have the memory sitting in the system (and he does), using it is *free*. – David Schwartz May 28 '12 at 09:36
  • While I agree with you on the technical level, you certainly understand that "taking memory out of a computer" is way less common than "writing a program so that it takes less memory". I am arguing that most readers will want you to discuss the second meaning instead of the first one (including the original poster, who needs a solution to his problem instead of a theoretical discussion about something that he will never do—remove physical memory from his computer). – Eric O. Lebigot May 28 '12 at 09:38
  • @EOL: The first step to solving a problem is understanding it. It seems, from what we know right now, that the problem is most likely CPU-related. If it's memory related, it's most likely only due to cache effects. (The system does not seem to be swapping and memory usage seems normal, at least as far as we know.) And, by the way, people really do bone-headed things like command their computers to drop or shrink caches to make more free memory. It's not like people don't do very bad things due to the misunderstanding that free memory is generally bad. – David Schwartz May 28 '12 at 10:06
3

From discussion it seems you are storing the data in nothing but a giant huge dict (not often I get to say that with a straight face ;) ) Maybe offsetting the data onto a proper database like redis might reduce the memory usage of python. It might also make your data more efficient and faster to work with.

Jakob Bowyer
  • 33,878
  • 8
  • 76
  • 91
1

I would guess that your program slows down because at least one of the following reasons:

  • Your memory starts swapping, with data going from RAM to disk and vice versa. The solution is indeed that your program use less memory.
  • The algorithm that you use scales badly with the data size. In this case, finding a better algorithm is obviously the solution.

In both cases, we would need to see some of your code (what it essentially amounts to) in order to give a more specific solution.

Common solutions include

  • Using Python's del in order to indicate that a variable is not needed anymore.
  • Using iterators instead of lists (iterators do not use much memory).
Eric O. Lebigot
  • 91,433
  • 48
  • 218
  • 260
  • He posted his stats. His cache size vastly exceeds his swap space used. His system is isn't swapping. (Likely it just once swapped out data that was never used since the system started up.) – David Schwartz May 28 '12 at 09:34
  • You may be right about `del`. Leaking memory, even if it doesn't use up the system's memory (which it doesn't in this case) can destroy your code's memory efficiency since its working set won't fit into cache. That can cause excessive CPU usage. – David Schwartz May 28 '12 at 09:38
  • @DavidSchwartz: I see what you are saying, about the cache versus used swap space. However, it is not clear when the result of `free -t` was obtained. Maybe it is after the program was run instead of during some swapping time?? – Eric O. Lebigot May 28 '12 at 09:43
  • If it was after the program was run, then it shows that the program wasn't using a lot of memory itself because memory the program was using would likely still be free since the system wouldn't have made it cache yet. (Plus, it's not helpful to make the most unlikely hypotheses first. When you hear hoofprints, start out assuming it's horses, not zebras. At least until you see some evidence to the contrary.) – David Schwartz May 28 '12 at 10:05