Python 并行化计算

坑一:多线程

在C或Java中我们都已经熟悉了使用多线程来进行并行化计算,Python也提供了多线程的接口来实现多线程编程。但使用过了才知道,性能非但没有提高,反而有所降低!!

原因是Python中的多线程并不是真正的多线程,因为解释器全局锁(GIL)的存在,每一时刻只有一个线程在工作,性能不可能有所提高,反而会因为切换线程带来的开销降低性能!

The Python threading module uses threads instead of processes. Threads run in the same unique memory heap. Whereas Processes run in separate memory heaps. This, makes sharing information harder with processes and object instances. One problem arises because threads use the same memory heap, multiple threads can write to the same location in the memory heap which is why the global interpreter lock(GIL) in CPython was created as a mutex to prevent it from happening.

另可参见Python 最难的问题中的讨论。

因此,要想实现并行化计算请使用多进程而不是多线程!

坑二:Pickable

使用多进程的常见方法是使用进程池(Pool),demo代码如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
import multiprocessing

def func(msg):
print "msg:", msg

if __name__ == "__main__":
pool = multiprocessing.Pool(processes=3)
for i in xrange(3):
msg = "hello {}".format(i)
pool.apply_async(func, (msg,))

print "Mark~ Mark~ Mark~~~~~~~~~~~~~~~~~~~~~~"
pool.close()
pool.join() # behind close() or terminate()
print "Sub-process(es) done."

一般情况下这种代码是能够正常工作的,但是这要求传入的数据是Pickable的,而numpy中的很多库不满足这个条件,在运行这种代码的时候就会报如下错误:

PicklingError: Can’t pickle : attribute lookup __builtin__.function failed

不幸的是,这种错误我至今没能解决。

我不得不放弃Pool,而是直接使用mulitiprocessing库,demo代码如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
from multiprocessing import Process
import os
import time

def run_proc(name):
time.sleep(3)
print 'Run child process %s (%s)...' % (name, os.getpid())

if __name__=='__main__':
print 'Parent process %s.' % os.getpid()
processes = list()
for i in range(5):
p = Process(target=run_proc, args=('test',))
print 'Process will start.'
p.start()
processes.append(p)

for p in processes:
p.join()
print 'Process end.'

另可参见python 多进程使用总结中的详细解释。

坑三:共享变量

根据常识,不同的进程不共享变量。Python在multiprocessing中为我们提供了多种可以共享的数据类型:

1
2
3
4
5
6
7
8
9
# 1.数值型
num=multiprocessing.Value("d",10.0) # 'd' for 'double'

# 2.数组型
num=multiprocessing.Array("i",[1,2,3,4,5]) # 'i' for 'int'

# 3.dict,list
mydict=multiprocessing.Manager().dict()
mylist=multiprocessing.Manager().list(range(5))

Value和Array的第一个参数指定数据类型,完整的对应表如下:

Type code C Type Python Type
‘c’ char character
‘b’ signed char int
‘B’ unsigned char int
‘u’ Py_UNICODE Unicode character
‘h’ signed short int
‘H’ unsigned short int
‘i’ signed int int
‘I’ unsigned int long
‘l’ signed long int
‘L’ unsingned long long
‘f’ float float
‘d’ double float

另可参见:

一个例子

python + hdf5 + multiprocessing