0

I have about 4 input text files that I want to read them and write all of them into one separate file. I use two threads so it runs faster!
Here is my questions and code in python:

1-Does each thread has its own version of variables such as "lines" inside the function "writeInFile"?

2-Since I copied some parts of the code from Tutorialspoint, I don't understand what is "while 1: pass" in the last line. Can you explain? link to the main code: http://www.tutorialspoint.com/python/python_multithreading.htm

3-Does it matter what delay I put for the threads?

4-If I have about 400 input text files and want to do some operations on them before writing all of them into a separate file, how many threads I can use?

5- If assume I use 10 threads, is it better to have the inputs in different folders (10 folders with 40 input text files each) and for each thread call one folder OR I use what I already done in the below code in which I ask each thread to read one of the 400 input text files if they have not been read before by other threads?

processedFiles=[] # this list to check which file in the folder has already been read by one thread so the other thread don't read it

#Function run by the threads
def writeInFile( threadName, delay):
   for file in glob.glob("*.txt"):

      if file not in processedFiles:
         processedFiles.append(file)
         f = open(file,"r")
         lines = f.readlines()
         f.close()

         time.sleep(delay)
         #open the file to write in
         f = open('myfile','a')
         f.write("%s \n" %lines) 
         f.close()
         print "%s: %s" % ( threadName, time.ctime(time.time()) )



# Create two threads as follows
try:
   f = open('myfile', 'r+')
   f.truncate()

   start = timeit.default_timer()

   thread.start_new_thread( writeInFile, ("Thread-1", 0, ) )
   thread.start_new_thread( writeInFile, ("Thread-2", 0, ) )
   stop = timeit.default_timer()

   print stop - start

except:
   print "Error: unable to start thread"


while 1:
   pass
Mahin Ra
  • 41
  • 7

1 Answers1

0
  1. Yes. Each of the local variables is on the thread's stack and are not shared between threads.
  2. This loop allows the parent thread to wait for each of the child threads to finish and exit before termination of the program. The actual construct you should use to handle this is join and not a while loop. See what is the use of join() in python threading.
  3. In practice, yes, especially if the threads are writing to a common set of files (e.g., both thread 1 and thread2 will be reading/writing to the same file). Depending on the hardware, the size of the files and the amount of data you re trying to write, different delays may make your program feel more responsive to the user than not. The best bet is to start with a simple value and adjust it as you see the program work in a real-world setting.
  4. While you can technically use as many threads as you want, you generally won’t get any performance benefits over 1 thread per core per CPU.
  5. Different folders won’t matter as much for only 400 files. If you’re talking about 4,000,000 files, than it might matter for instances when you want to do ls on those directories. What will matter for performance is whether each thread is working on it's own file or whether two or more threads might be operating on the same file.

General thought: while it is a more advanced architecture, you may want to try to learn/use celery for these types of tasks in a production environment http://www.celeryproject.org/.

Community
  • 1
  • 1
2ps
  • 15,099
  • 2
  • 27
  • 47