0

I have a large dataframe and a column "image" in it, the data in "image" is the file name(with extension equals "jpg" or "jpeg") of a large amount of files. Some files exist with right extension, but others not. So, I have to check whether "image" data is right, but it takes 30 seconds with single-threading, I then decide to do this with multi-threading.

I have written a code with Python(3.6.5) to check this, it runs well when I execute it on Command Line, but error occurs when I execute it on Spyder(3.2.8), how could I do to avoid this?

Here is my code:

# -*- coding: utf-8 -*-
import multiprocessing
import numpy as np
import os
import pandas as pd
from multiprocessing import Pool

#some large scale DataFrame, the size is about (600, 15)
waferDf = pd.DataFrame({"image": ["aaa.jpg", "bbb.jpeg", "ccc.jpg", "ddd.jpeg", "eee.jpg", "fff.jpg", "ggg.jpeg", "hhh.jpg"]})
waferDf["imagePath"] = np.nan

#to parallelize whole process
def parallelize(func, df, uploadedDirPath):
    partitionCount = multiprocessing.cpu_count()
    partitions = np.array_split(df, partitionCount)
    paras = [(part, uploadedDirPath) for part in partitions]
    pool = Pool(partitionCount)
    df = pd.concat(pool.starmap(func, paras))
    pool.close()
    pool.join()
    return df

#check whether files exist
def checkImagePath(partialDf, uploadedDirPath):
    for index in partialDf.index.values:
        print(index)
        if os.path.exists(os.path.join(uploadedDirPath, partialDf.loc[index, ["image"]][0].replace(".jpeg\n", ".jpeg"))):
            partialDf.loc[index, ["imagePath"]][0] = os.path.join(uploadedDirPath, partialDf.loc[index, ["image"]][0].replace(".jpeg\n", ".jpeg"))
        elif os.path.exists(os.path.join(uploadedDirPath, partialDf.loc[index, ["image"]][0].replace(".jpeg\n", ".jpg"))):
            partialDf.loc[index, ["imagePath"]][0] = os.path.join(uploadedDirPath, partialDf.loc[index, ["image"]][0].replace(".jpeg\n", ".jpg"))
        print(partialDf)
    return partialDf

if __name__ == '__main__':
    waferDf = parallelize(checkImagePath, waferDf, "/eap/uploadedFiles/")
    print(waferDf)

and here is the error:

runfile('C:/Users/00048564/Desktop/Multi-Threading.py', wdir='C:/Users/00048564/Desktop')
Traceback (most recent call last):

  File "<ipython-input-24-732edc0ea3ea>", line 1, in <module>
    runfile('C:/Users/00048564/Desktop/Multi-Threading.py', wdir='C:/Users/00048564/Desktop')

  File "C:\ProgramData\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 705, in runfile
    execfile(filename, namespace)

  File "C:\ProgramData\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 102, in execfile
    exec(compile(f.read(), filename, 'exec'), namespace)

  File "C:/Users/00048564/Desktop/Multi-Threading.py", line 35, in <module>
    waferDf = parallelize(checkImagePath, waferDf, "/eap/uploadedFiles/")

  File "C:/Users/00048564/Desktop/Multi-Threading.py", line 17, in parallelize
    pool = Pool(partitionCount)

  File "C:\ProgramData\Anaconda3\lib\multiprocessing\context.py", line 119, in Pool
    context=self.get_context())

  File "C:\ProgramData\Anaconda3\lib\multiprocessing\pool.py", line 174, in __init__
    self._repopulate_pool()

  File "C:\ProgramData\Anaconda3\lib\multiprocessing\pool.py", line 239, in _repopulate_pool
    w.start()

  File "C:\ProgramData\Anaconda3\lib\multiprocessing\process.py", line 105, in start
    self._popen = self._Popen(self)

  File "C:\ProgramData\Anaconda3\lib\multiprocessing\context.py", line 322, in _Popen
    return Popen(process_obj)

  File "C:\ProgramData\Anaconda3\lib\multiprocessing\popen_spawn_win32.py", line 33, in __init__
    prep_data = spawn.get_preparation_data(process_obj._name)

  File "C:\ProgramData\Anaconda3\lib\multiprocessing\spawn.py", line 172, in get_preparation_data
    main_mod_name = getattr(main_module.__spec__, "name", None)

AttributeError: module '__main__' has no attribute '__spec__'

1 Answers1

0

In most cases,when you run python script from command line by calling keyword python 'YourFile.py' , script is executed as main program.Hence it was able to call required modules such as multiprocessing and other modules shown on your error trace.

However, your Spyder configurations could be different and your instruction to run the script as main program is not working .

Were you able to successfully run any script from Spyder that has

if __name__ == '__main__':

Read the accepted answer on this thread https://stackoverflow.com/a/419185/9968677

Muthu
  • 21
  • 5
  • thanks for your reply Muthu, but I have tried this solution and got the same error. Maybe it's about the IDE itself I guess... – Albert Chang Sep 17 '18 at 02:01
  • Maybe this is the answer: https://stackoverflow.com/questions/48078722/no-multiprocessing-print-outputs-spyder – Albert Chang Sep 19 '18 at 07:51