1

I'm tring to learn how to use threads in python to save a list of object. I'm starting from this code :

import threading
import urllib
from tempfile import NamedTemporaryFile

singlelock = threading.Lock() 

class download(threading.Thread):
    def __init__(self, sitecode, lista):
        threading.Thread.__init__(self)
        self.sitecode = sitecode
        self.status = -1

    def run(self):
        url = "http://waterdata.usgs.gov/nwis/monthly?referred_module=sw&site_no="
        url += self.sitecode 
        url += "&PARAmeter_cd=00060&partial_periods=on&format=rdb&submitted_form=parameter_selection_list"
        tmp = NamedTemporaryFile(delete=False)
        urllib.urlretrieve(url, tmp.name)
        print "loaded Monthly data for sitecode : ",  self.sitecode 
        lista.append(tmp.name)
        print lista

sitecodelist = ["01046500", "01018500", "01010500", "01034500", "01059000", "01066000", "01100000"]
lista = []


for k in sitecodelist:
    get_data = download(k,lista)
    get_data.start()

It just prints out the list generated during the thread execution, while I'm tring to return it.

Trying to read the documentation, I'm looking on how to use threading.Lock() and its methods acquire() and release() that seems to be the solution to my issue ... but I'm really far to understand how to implement it in my example code.

thanks so much for any hints!

miku
  • 181,842
  • 47
  • 306
  • 310
epifanio
  • 1,228
  • 1
  • 16
  • 26

2 Answers2

3

First of all we should all quickly review what threads are http://en.wikipedia.org/wiki/Thread_%28computer_science%29.

Ok, so threads share memory. So this should be easy! Which is also the good and bad thing about threads, it's easy and dangerous! (also lightweight for the OS).

Now, if using, python with cpython you should familiarize yourself with the global interpreter lock:

http://docs.python.org/glossary.html#term-global-interpreter-lock

Also, from http://docs.python.org/library/threading.html:

CPython implementation detail: Due to the Global Interpreter Lock, in CPython only one thread can execute Python code at once (even though certain performance-oriented libraries might overcome this limitation). If you want your application to make better of use of the computational resources of multi-core machines, you are advised to use multiprocessing. However, threading is still an appropriate model if you want to run multiple I/O-bound tasks simultaneously.

What does this mean? If your task isn't IO threading won't gain you anything from the OS since any time you do anything with python code, only a single thread will be able to do anything since it has the global lock and no other threads can get it. With IO bound tasks the OS will schedule other threads since the global lock will be released while waiting for the IO to complete. There is the caveat though that you could be calling into code that does not fall under the GIL and in that case threading will also perform well (hence the reference to "performance oriented libraries" above.)

Thankfully, python makes managing the shared memory a simple task and there is already good documentation on how to do so, though it took me a small bit to find it. If you have any further questions let us know.

In [83]: import _threading_local

In [84]: help(_threading_local)
Help on module _threading_local:

NAME
    _threading_local - Thread-local objects.

FILE
    /opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/_threading_local.py

MODULE DOCS
    http://docs.python.org/library/_threading_local

DESCRIPTION
    (Note that this module provides a Python version of the threading.local
     class.  Depending on the version of Python you're using, there may be a
     faster one available.  You should always import the `local` class from
     `threading`.)

    Thread-local objects support the management of thread-local data.
    If you have data that you want to be local to a thread, simply create
    a thread-local object and use its attributes:

      >>> mydata = local()
      >>> mydata.number = 42
      >>> mydata.number
      42

    You can also access the local-object's dictionary:

      >>> mydata.__dict__
      {'number': 42}
      >>> mydata.__dict__.setdefault('widgets', [])
      []
      >>> mydata.widgets
      []

    What's important about thread-local objects is that their data are
    local to a thread. If we access the data in a different thread:

      >>> log = []
      >>> def f():
      ...     items = mydata.__dict__.items()
      ...     items.sort()
      ...     log.append(items)
      ...     mydata.number = 11
      ...     log.append(mydata.number)

      >>> import threading
      >>> thread = threading.Thread(target=f)
      >>> thread.start()
      >>> thread.join()
      >>> log
      [[], 11]

    we get different data.  Furthermore, changes made in the other thread
    don't affect data seen in this thread:

      >>> mydata.number
      42

    Of course, values you get from a local object, including a __dict__
    attribute, are for whatever thread was current at the time the
    attribute was read.  For that reason, you generally don't want to save
    these values across threads, as they apply only to the thread they
    came from.

    You can create custom local objects by subclassing the local class:

      >>> class MyLocal(local):
      ...     number = 2
      ...     initialized = False
      ...     def __init__(self, **kw):
      ...         if self.initialized:
      ...             raise SystemError('__init__ called too many times')
      ...         self.initialized = True
      ...         self.__dict__.update(kw)
      ...     def squared(self):
      ...         return self.number ** 2

    This can be useful to support default values, methods and
    initialization.  Note that if you define an __init__ method, it will be
    called each time the local object is used in a separate thread.  This
    is necessary to initialize each thread's dictionary.

    Now if we create a local object:

      >>> mydata = MyLocal(color='red')

    Now we have a default number:

      >>> mydata.number
      2

    an initial color:

      >>> mydata.color
      'red'
      >>> del mydata.color

    And a method that operates on the data:

      >>> mydata.squared()
      4

    As before, we can access the data in a separate thread:

      >>> log = []
      >>> thread = threading.Thread(target=f)
      >>> thread.start()
      >>> thread.join()
      >>> log
      [[('color', 'red'), ('initialized', True)], 11]

    without affecting this thread's data:

      >>> mydata.number
      2
      >>> mydata.color
      Traceback (most recent call last):
      ...
      AttributeError: 'MyLocal' object has no attribute 'color'

    Note that subclasses can define slots, but they are not thread
    local. They are shared across threads:

      >>> class MyLocal(local):
      ...     __slots__ = 'number'

      >>> mydata = MyLocal()
      >>> mydata.number = 42
      >>> mydata.color = 'red'

    So, the separate thread:

      >>> thread = threading.Thread(target=f)
      >>> thread.start()
      >>> thread.join()

    affects what we see:

      >>> mydata.number
      11

    >>> del mydata

And just in case... an example using your style above.

In [40]: class TestThread(threading.Thread):
    ...:     report = list() #shared across threads
    ...:     def __init__(self):
    ...:         threading.Thread.__init__(self)
    ...:         self.io_bound_variation = random.randint(1,100)
    ...:     def run(self):
    ...:         start = datetime.datetime.now()
    ...:         print '%s - io_bound_variation - %s' % (self.name, self.io_bound_variation)
    ...:         for _ in range(0, self.io_bound_variation):
    ...:             with open(self.name, 'w') as f:
    ...:                 for i in range(10000):
    ...:                     f.write(str(i) + '\n')
    ...:         print '%s - finished' % (self.name)
    ...:         end = datetime.datetime.now()
    ...:         print '%s took %s time' % (self.name, end - start)
    ...:         self.report.append(end - start)
    ...:             

And a run of three threads with output.

    In [43]: threads = list()
        ...: for i in range(3):
        ...:     t = TestThread()
        ...:     t.start()
        ...:     threads.append(t)
        ...: 
        ...: for thread in threads:
        ...:     thread.join()
        ...:     
        ...: for thread in threads:
        ...:     print thread.report
        ...:     
    Thread-28 - io_bound_variation - 76
    Thread-29 - io_bound_variation - 83
    Thread-30 - io_bound_variation - 80
    Thread-28 - finished
    Thread-28 took 0:00:08.173861 time
    Thread-30 - finished
    Thread-30 took 0:00:08.407255 time
    Thread-29 - finished
    Thread-29 took 0:00:08.491480 time
    [datetime.timedelta(0, 5, 733093), datetime.timedelta(0, 6, 253811), datetime.timedelta(0, 6, 440410), datetime.timedelta(0, 4, 342053), datetime.timedelta(0, 5, 520407), datetime.timedelta(0, 5, 948238), datetime.timedelta(0, 8, 173861), datetime.timedelta(0, 8, 407255), datetime.timedelta(0, 8, 491480)]
    [datetime.timedelta(0, 5, 733093), datetime.timedelta(0, 6, 253811), datetime.timedelta(0, 6, 440410), datetime.timedelta(0, 4, 342053), datetime.timedelta(0, 5, 520407), datetime.timedelta(0, 5, 948238), datetime.timedelta(0, 8, 173861), datetime.timedelta(0, 8, 407255), datetime.timedelta(0, 8, 491480)]
    [datetime.timedelta(0, 5, 733093), datetime.timedelta(0, 6, 253811), datetime.timedelta(0, 6, 440410), datetime.timedelta(0, 4, 342053), datetime.timedelta(0, 5, 520407), datetime.timedelta(0, 5, 948238), datetime.timedelta(0, 8, 173861), datetime.timedelta(0, 8, 407255), datetime.timedelta(0, 8, 491480)]

You may wonder why report has more then three elements... that is because I ran the above for loop code three times in my interpreter. If I wanted to fix this "bug", I need to make sure to set the shared variable to an empty list before running.

TestThread.report = list()

Thus illustrates why threads can become unwieldy.

Derek Litz
  • 10,529
  • 7
  • 43
  • 53
2

This doesn't answer your question directly but this is a workaround using the multiprocessing module instead:

from multiprocessing import Pipe, Process
import urllib
from tempfile import NamedTemporaryFile


def download(conn, sitecodelist):
    lista = []
    for k in sitecodelist:
        url = 'http://waterdata.usgs.gov/nwis/monthly?referred_module=sw&site_no='
        url += k
        url += '&PARAmeter_cd=00060&partial_periods=on&format=rdb&submitted_form=parameter_selection_list'
        tmp = NamedTemporaryFile(delete=False)
        urllib.urlretrieve(url, tmp.name)
        print 'loaded Monthly data for sitecode : ',  k
        lista.append(tmp.name)
    conn.send(lista)

sitecodelist = ['01046500', '01018500', '01010500', '01034500', '01059000', '01066000', '01100000']

parent, child = Pipe()
process = Process(target=download, args=(child, sitecodelist))
process.start()

data = parent.recv()
print 'Data: ', data
process.join()

And just in case, this a question about using multiprocessing or threading in your Python script: multiprocess or threading in python?

Hope that helps!

Community
  • 1
  • 1
César
  • 9,939
  • 6
  • 53
  • 74
  • This will only spawn a single sub process right? The example from the OP spawns a thread for every sitecode in sitecodelist. – Derek Litz Dec 10 '11 at 17:31