0

I'm populating a matrix using a conditional lookup from a file. The file is extremely large (25,00,000 records) and is saved as a dataframe ('file'). Each matrix row operation (lookup) is independent of the other. Is there anyway I could parallelize this process?

I'm working in pandas and python. My current approach is bare naive.

for r in row:
    for c in column:
        num=file[(file['Unique_Inventor_Number']==r) & file['AppYearStr']==c)]['Citation'].tolist()
        num = len(list(set(num)))
        d.set_value(r, c, num)
Joseph Chotard
  • 666
  • 5
  • 15
FlyingAura
  • 1,541
  • 5
  • 26
  • 41

2 Answers2

1

For 2.5 million records you should be able to do

res = file.groupby(['Unique_Inventor_Number', 'AppYearStr']).Citation.nunique()

The matrix should be available in

res.unstack(level=1).fillna(0).values

I'm not sure if the is the fastest, but should be significantly faster than your implementation

user1827356
  • 6,764
  • 2
  • 21
  • 30
  • The group by operation indeed fastens up the process. Can you please explain in gist how this happens? `res.unstack(level=1).fillna(0).values` however does not result in a matrix. – FlyingAura Feb 16 '17 at 12:32
  • I think to create a matrix, just `res.unstack(level=1).fillna(0)` would suffice. – FlyingAura Feb 16 '17 at 12:37
0

[EDIT] As Roland mentioned in comment, in a standard Python implementation, this post does not offer any solution to improve CPU performances.

In the standard Python implementation, threads do not really improve performance on CPU-bound tasks. There is a "Global Interpreter Lock" that enforces that only one thread at a time can be executing Python bytecode. This was done to keep the complexity of memory management down.

Have you tried to use different Threads for the different functions?

Let's say you separate your dataframe into columns and create multiple threads. Then you assign each thread to apply a function to a column. If you have enough processing power, you might be able to gain a lot of time:

from threading import Thread
import pandas as pd
import numpy as np
from queue import Queue
from time import time

# Those will be used afterwards
N_THREAD = 8
q = Queue()
df2 = pd.DataFrame()  # The output of the script 

# You create the job that each thread will do
def apply(series, func):
    df2[series.name] = series.map(func)


# You define the context of the jobs
def threader():
    while True:
        worker = q.get()
        apply(*worker)
        q.task_done()

def main():

    # You import your data to a pandas dataframe
    df = pd.DataFrame(np.random.randn(100000,4), columns=['A', 'B', 'C', 'D'])

    # You create the functions you will apply to your columns
    func1 = lambda x: x<10
    func2 = lambda x: x==0
    func3 = lambda x: x>=0
    func4 = lambda x: x<0
    func_rep = [func1, func2, func3, func4]

    for x in range(N_THREAD):  # You create your threads    
        t = Thread(target=threader)
        t.start()

    # Now is the tricky part: You enclose the arguments that
    # will be passed to the function into a tuple which you
    # put into a queue. Then you start the job by "joining"
    # the queue
    for i, func in enumerate(func_rep):
        worker = tuple([df.iloc[:,i], func])
        q.put(worker)

    t0 = time()
    q.join()
    print("Entire job took: {:.3} s.".format(time() - t0))

if __name__ == '__main__':
    main()
qmeeus
  • 2,341
  • 2
  • 12
  • 21
  • In the standard Python implementation, threads do not really improve performance on CPU-bound tasks. There is a "Global Interpreter Lock" that enforces that only *one* thread at a time can be executing Python bytecode. This was done to keep the complexity of memory management down. – Roland Smith Feb 16 '17 at 18:09
  • Thank you for this explanation, I did not know this fact ! – qmeeus Feb 22 '17 at 18:40
  • In cases where functions block such as I/O (not so likely with a Pandas apply function, but still plausible), using threads in Python can still lead to very effective speedups – Tim McNamara Nov 05 '17 at 21:36