At my work we have a Python library that interfaces to all our API micro services (that are written in Java/Scala). It is a very useful tool for debugging and working with our platform, so I spend a lot of my time in a Python REPL.
Often times I find myself needing to hit multiple APIs in parallel. Since Python is synchronous, I was looking for an easy way to parallelize my requests, without having to write a lot of lines of code (since I am using REPL).
Turns out Python has a multiprocessing.dummy module, described in the docs as follows:
multiprocessing.dummy replicates the API of multiprocessing but is no more than a wrapper around the threading module.
With this module, calls are parallelized with just a few lines as follows:
from multiprocessing.dummy import Pool
pool = Pool(10) # Number of concurrent threads
asyncresponse = pool.map(somesynfunction, somelistofarguments)
pool.close()
pool.join()
Code language: PHP (php)
I created the following real world example to show how it works.
Let say I have a list of zip codes from 94400 to 94420 and I wanted to check which one of them are valid US zip codes. I could use the free API from Google to query location data for each zip code, you can try it in your browser:
For my range of zip codes, I actually run the program twice: once using sync flow and another using 10 threads. I timed both approaches.
import requests
from multiprocessing.dummy import Pool
import time
from datadiff import diff
def getzip(code):
try:
code = str(code)
url = "https://maps.googleapis.com/maps/api/geocode/json?address={}".format(code)
res = requests.get(url).json()['results']
if len(res) < 1: # Re-try
print "Retrying"
return getzip(code)
iszip = 'postal_code' in res[0]['types'] and "United States" in str(res[0]['address_components'])
except Exception as e:
print "In error"
print e
iszip = False
return (code, iszip)
ziprange = range(94400, 94420)
print "Range is: " + str(len(ziprange))
print "Using one thread"
start = time.time()
syncres = [getzip(c) for c in ziprange]
print "took " + str(time.time() - start)
print "Using multiple threads"
start = time.time()
# Magic
pool = Pool(10)
asyncres = pool.map(getzip, ziprange)
pool.close()
pool.join()
asyncres = sorted(asyncres)
# End of Magic
print "took " + str(time.time() - start)
# Make sure results are equal
d = diff(syncres, asyncres)
if len(d.diffs) > 0:
print "diff is"
print d
for r in asyncres:
print "Zip code {} is {} US code".format(r[0], "valid" if r[1] else "invalid")
Code language: PHP (php)
My sample run resulted in the following output:
$ python getzip.py
Range is: 20
Using one thread
took 7.47538208961
Using multiple threads
took 3.59181404114
Zip code 94400 is invalid US code
Zip code 94401 is valid US code
Zip code 94402 is valid US code
Zip code 94403 is valid US code
Zip code 94404 is valid US code
Zip code 94405 is invalid US code
Zip code 94406 is invalid US code
Zip code 94407 is invalid US code
Zip code 94408 is invalid US code
Zip code 94409 is invalid US code
Zip code 94410 is invalid US code
Zip code 94411 is invalid US code
Zip code 94412 is invalid US code
Zip code 94413 is invalid US code
Zip code 94414 is invalid US code
Zip code 94415 is invalid US code
Zip code 94416 is invalid US code
Zip code 94417 is invalid US code
Zip code 94418 is invalid US code
Zip code 94419 is invalid US code
Few things to note
- Google has a rate limit on their API, so my code ended up doing a lot of retries. In real world use cases I see a much bigger speed up.
- There is no guarantee for results to be returned in the same order, as one would get with a sync flow. For some use cases results will need to be sorted after the fact.
In conclusion, I find this approach to be the most user friendly way to parallelize Python API calls with a few lines of code.
Update 08/18/2017: A Reddit user improved the code to use logging and make it work with Python 3. You can check it out here and the revision history from this post as a starting point here.