Hmmmm how can I make this faster? I have idea, I’ll just run it in parallel.
Luckily I am working with Python, and we have PEP20:
There should be one— and preferably only one —obvious way to do it.
So what is the obvious way to do it:
There are 5 different popular packages to do this: multiprocessing, subprocess, threading, gevent
:FacePalm:
This talk will cover the main concurrency paradigms, show you the pros and cons of each and give you a framework for picking the right solution for your project
Objectives
Attendees will learn the main multiprocessing options in both python 2.7 and python 3. Will leave with a framework for determining which approach is best for them
Detailed Abstract
Concurrency is hard. As a lay-developer there is a lot of ramping up to figure out how to solve what would seem like simple problems:
“I want to check the status of 1000 urls?”
“how can I run my test suite in parallel?”
“I have millions of jobs on a queue — what is the best way to spawn workers to process them?”
With Python you have many options, each one does a certain thing well. Here we will explain the tools in our toolbelt so you can pick the right tool the problem you are trying to solve.
threading: interface for threads, mutexs and queues
multiprocessing: is similar to threading but offers local and remote concurrency (with some gotchas)
subprocessing: Allows you to spawn new processes with minimal memory sharing support. But great for a lot of things
gevent: a coroutine-based Python networking library that uses greenlets
Outline
Background — This is hard
Threads
Processes
Pipes
GIL
Subprocesses
How to use
Joins
Pipes
Good use cases
Multiprocessing
How to use
Sharing Memory (SyncManager)
Handing Interupts
Good use cases
Gevent
How to use
Monkey Patching
Good Use Cases
Threading
How to use
Locks, Conditions, Timers
Good Use Cases
Summary
“Do not cross the streams”
Decision Framework
What about tulip
Additional Notes:
My parallelized version of lettuce is open sourced here
I have other open-source libraries, can find them here
This is my first time speaking at PyCon. I have spoken at Boston Python. My slides for that talk are here
Love your test suite again (or for the first time).
Have you ever met a developer who loves their test suite for their web app. There is subtle air of confidence surrounding them. They stand a bit taller; walk with a bit of a swagger.
This talk will put you on that path, by showing an approach of Behavior Driven Development using lettuce and selenium
Detailed Abstract
Behavior Driven Development (BDD) is a development process based on Test-Driven Development – but makes a significant modification. With TDD – the main goal was to achieve test-coverage (what percentage of your code is covered by tests). With BDD – the driving question – “What percentage of my user stories are covered?” The main test-unit is the user story.
1234567891011
Story: Returns go to stock
In order to keep track of stock
As a store owner
I want to add items back to stock when they're returned
Scenario 1: Refunded items should be returned to stock
Given a customer previously bought a black sweater from me
And I currently have three black sweaters left in stock
When he returns the sweater for a refund
Then I should have four black sweaters in stock
With an application that leverages BDD, will have a set of feature files, and a feature file will have a collection of stories written is a specific format Given, When, Then.
Python has some tooling that helps turn this format into a fully automated test framework. We use lettuce to process feature files, and selenium to drive the web browser.
Outline:
Pragmatic Behavior Driven Development
Description
Love your test suite again (or for the first time).
Have you ever met a developer who loves their test suite for their web app? There is subtle air of confidence surrounding them. They stand a bit taller; walk with a bit of a swagger.
This talk will put you on that path by showing an approach of Behavior Driven Development using lettuce and selenium.
Detailed Abstract
Behavior Driven Development (BDD) is a development process based on Test-Driven Development – but makes a significant modification. With TDD – the main goal was to achieve test coverage (what percentage of your code is covered by tests). With BDD the driving question is “What percentage of my user stories are covered?” The main test unit is the user story.
1234567891011
Story: Returns go to stock
In order to keep track of stock
As a store owner
I want to add items back to stock when they're returned
Scenario 1: Refunded items should be returned to stock
Given a customer previously bought a black sweater from me
And I currently have three black sweaters left in stock
When he returns the sweater for a refund
Then I should have four black sweaters in stock
With an application that leverages BDD, we will have a set of feature files, and a feature file will have a collection of stories written is a specific format: Given, When, Then.
Python has some tooling that helps turn this format into a fully automated test framework. We use lettuce to process feature files and selenium to drive the web browser.
Outline:
Intro (5 mins)
Who am I?
20 years of development and testing, and what I have seen over time
Explain why BDD is an important evolution
Writing your first Test (5 mins)
Explain the Gherkin Format (Given, When, Then)
Write your first test – watch it fail
Given / When / Then (15 mins)
Given: Using Factories to set up your data
When: Trigger Events
Then: Writing “Deterministic” assertions
Setting up your testing environment (10 mins)
The terrain.py file
Work around javascript timing issues with a selenium adaptor
Other tools to flesh out your test suite (5 mins)
Coverage
Flake8
Travis (or CI)
Unit Test Frameworks (such as nose)
Additional Notes:
My parallelized version of lettuce is open sourced here
I have other open-source libraries, can find them here
This is my first time speaking at PyCon. I have spoken at Boston Python. My slides for that talk are here
Is it ironic that the documentation for descriptors is not very descriptive
Descriptors are one of my favorite Python features — but it took me too long to discover them. The documentation and tutorials that I found were too complex for me. So I would like to offer a different approach, a code first approach
Agenda
Definition
A Problem that Descriptors Can Solve
CODE!! Solution to the Problem
Reflection on the Code, and explanation on how we used Descriptors
Definition
In general, a descriptor is an object attribute with “binding behavior”, one whose attribute access has been overridden by methods in the descriptor protocol. Those methods are __get__(), __set__(), and __delete__(). If any of those methods are defined for an object, it is said to be a descriptor.
Read that and stash it in your brain for a few minutes. By the end of this article you’ll grok this
A Problem that Descriptors Can Solve
Imagine that you need to consume a 3rd party API that returns json documents. Often solutions to this problem look like this …
I dislike this solution, Its concise, but it breaks separation-of-concerns. The code consuming the API should not be concerned about the exact path and location of the data element in the json document.
# Hey guys this is a descriptor -- Woot!! classExtractor(object):def__init__(self,*path)self.path=pathdef__get__(self,instance,owner):return_extract(instance.json_blob,*self.path)def_extract(doc,*keys):""" digs into an dict or lists, if anything along the way is None, then simply return None """end_of_chain=your_dictforkeyinkeys:ifisinstance(end_of_chain,dict)andkeyinend_of_chain:end_of_chain=end_of_chain[key]elifisinstance(end_of_chain,(list,tuple))andisinstance(key,int):end_of_chain=end_of_chain[key]else:returnNonereturnend_of_chain
The real magic of descriptors happens with the signatures of __get__(), __set__(), and __delete__():
object.__get__(self, instance, owner)
object.__set__(self, instance, value)
object.__delete__(self, instance)
Each of these signatures contains a reference to instance, which is the instance of the owner’s class. So in our example:
instance will be an instance of the User class
owner will be the User class
self is the instance of the Descriptor, which in our case holds the path attribute.
Let’s take a look at our example where we made a descriptor Extractor.
– user = UserAPI.get_by_id(111)
Here we get an instance of a User object, which has the json_blob stored on it from the GET request
print(user.name)
Now we call name on that object, which we defined: name = Extractor('result','username'). At this point when we call name it is going to use the Extractor descriptor to extract the value from the json_blob.
The concern of extracting data from a json blob is nicely contained in our Descriptor I think this is one of many great ways to use descriptors to DRY up your code.
Someone asked: what is user and they? It was a really good question and deserved a better explanation.
The first thing to consider is @authorization_method This is a method decorator,a really nice Python feature — particularly when you are writing framework code.
A method decorator is a method that takes in a method as an argument and returns a mutated method. (pause to re-read that)
Let’s take a look at this specific implementation:
123456789101112
# in bouncer/__init__.pydefget_authorization_method():return_authorization_method_authorization_method=Nonedefauthorization_method(original_method):"""The method that will be injected into the authorization target to perform authorization"""global_authorization_method_authorization_method=original_methodreturnoriginal_method
So in the instance of our authorization_method, we receive a function and store it in the global variable _authorization_method. We can make of use of this function later in application’s execution.
For example. In my talk I showed the can method:
12345678910
jonathan=User(name='jonathan',admin=False)marc=User(name='marc',admin=False)article=Article(author=jonathan)printcan(jonathan,EDIT,article)# Trueprintcan(marc,EDIT,article)# False# Can Marc view articles in general?printcan(marc,VIEW,Article)# True
can is defined as follows:
12345678910111213
defcan(user,action,subject):"""Checks if a given user has the ability to perform the action on a subject :param user: A user object :param action: an action string, typically 'read', 'edit', 'manage'. Use bouncer.constants for readability :param subject: the resource in question. Either a Class or an instance of a class. Pass the class if you want to know if the user has general access to perform the action on that type of object. Or pass a specific object, if you want to know if the user has the ability to that specific instance :returns: Boolean """ability=Ability(user,get_authorization_method())returnability.can(action,subject)
When “can” is called, it builds an Ability using the logic in method we decorated (stored) with @authorization_method
Having said that, let me explain what they and they.can is.
# in bouncer/models.pyclassRuleList(list):defappend(self,*item_description_or_rule,**kwargs):# Will check it a Rule or a description of a rule# construct a rule if necessary then appendiflen(item_description_or_rule)==1andisinstance(item_description_or_rule[0],Rule):item=item_description_or_rule[0]super(RuleList,self).append(item)else:# try to construct a ruleitem=Rule(True,*item_description_or_rule,**kwargs)super(RuleList,self).append(item)# alias append# so you can do things like this:# @authorization_method# def authorize(user, they):## if user.is_admin:# # self.can_manage(ALL)# they.can(MANAGE, ALL)# else:# they.can(READ, ALL)## def if_author(article):# return article.author == user## they.can(EDIT, Article, if_author)can=append
RuleList is a python list with two tweaks:
override append to handle inputing of Rules or something I can construct into a rule
alias append can = append which allows us to have the desired syntax they.can(READ, ALL)
I am pretty pleased with this; I really like the they.can(READ, ALL) syntax. Some may argue that it is not pythonic since I could be more explicit — but in this case I think ease of readability trumps style.
But if you don’t agree, no worries you can use the following equivalent syntax:
12345678910
@authorization_methoddefauthorize(user,abilities):ifuser.is_admin:abilities.append(MANAGE,ALL)else:abilities.append(READ,ALL)# See I am using a string hereabilities.append(EDIT,'Article',author=user)
Both work!
Hopefully this clarifies things. Feel free to ping me with additional questions.
Addendum
There has been a fair bit of discussion in my office about the grammatically correctness of they. Uncannily, xkcd comes to the rescue once again:
But that makes me want to barf. Luckily python gives us the tools to clean this up. We are going to use the Proxy pattern to solve it. At its simplest we can do something like so:
1234567891011121314151617181920212223242526
classProxy(object):def__init__(self,local):self.local=localdef__getattr__(self,name):returngetattr(self.local(),name)# aliasing for better syntax module_property=ProxyclassUser(object):"""Contrived User Object"""def__init__(self,**kwargs):self.name=kwargs.get('name','billy')defspeak(self):print("Well hello there!")defsay_hi(self,to_whom):print("Hi there {}".format(to_whom))@module_propertydefcurrent_user():returnUser()
With this we have come close to achieving our goal:
This simple Proxy class that we defined. Takes a function, stores in a local variable, and then when it is accessed it is executed with names and arguments passed through to it. __getattr__ is a pretty special feature of python.
The big gotcha with this is that current_user does not return a User object (like the built-in @property will return), it is going to return the Proxy object. So without a little bit of additional care you might run into issues.
The werkzeug team has developed a fully featured Proxy within the werkzerg project. If you are using werkzeug, you can find it: from werkzeug.local import LocalProxy
Its takes the proxy pattern further by overwriting all of the python object methods such as __eq__, __le__, __str__ and so on, to use the proxied object as the underlying target.
If you are not using werkzeug I have created a mini library where you can get the extracted proxy code. You can find it here: (http://github.com/jtushman/proxy_tools)
Or install it like so:
1
pip install proxy_tools
And use it like so:
12345678910
# your_module/__init__.pyfromproxy_toolsimportmodule_property@module_propertydefcurrent_user():returnUser.find_by_id(request['user_id'])# Then elsewherefromyour_moduleimportcurrent_userprint(current_user.name)
Now — I am sure there was a very good reason why the python-powers-that-be chose not add the @property syntax to modules. But for the time being I have found it useful and elegant.
tl;dr: I forked the lettuce package to use multiprocessing, tests run more then 4x faster on my MBP
I am a fan of Gabriel Falcão’s lettuce Behavior-Driven Development (BDD) tool. We have been using it on my team for 6+ months now. Recently our test suite completion time has crossed the 10 minute line, which had a bunch of negative effects, as you can imagine:
people writing less tests
people running the test suite less frequently
people spending more time watching a test suite run, then coding, …
We all are using relatively modern MBP with 4 cores, and we might as well make the most of them. Here is my fork of lettuce that allows you to take advantage of all of your cores:
I have made two main modifications (You will find the lion share of my modifications in this file):
I created a ParallelRunner (I have left the main runner alone), which kicks off processes to pull the scenarios off a queue
After each run I store the run times of each test in a .scenarios file, so in subsequent runs I can sort them longest to shortest
My test suite used to take 12 minutes, now its takes 2 minutes — REJOICE!
Usage
lettuce tests -p 4 -v 2
-p: stands for parallel. You can set it to how many processes you like, I find that the number of cores should be your default
-v is the same verbosity parameter, but I recommend setting it to 2 when using parallelization, otherwise the steps will interlace and not make much sense
in your terrain.py file, there are two new callbacks:
@before.batch and @after.batch
which you should use to set up and tear down each process. I use main to fire up flask, selenium and mongo. Also note that I set a port_number attribute on world which you can use set up processes specific servers. For example:
For this to work all of your tests need to be isolated, they can not depend on each other (which I think is best practice anyways). This means in your tests you should not use world at all Use scenario instead:
To do this, in your terrain file add the following:
And I use this all the time in my steps to refer to state from previous steps
12345678
@step(u'Given a user exists with one account')defgiven_a_user_exists(step):scenario.current_user=UserFactory.create()@step(u'And the user has a dog')defuser_has_a_dog(step):scenario.current_user.dog=DogFactory.create()
There are more then one paralyzation framework / paradigms out there for python. Make sure you pick the right one for you before you dive-in. To name a few:
frommultiprocessingimportProcess,Managerfromtimeimportsleepdeff(process_number):try:print"starting thread: ",process_numberwhileTrue:printprocess_numbersleep(3)exceptKeyboardInterrupt:print"Keyboard interrupt in process: ",process_numberfinally:print"cleaning up thread",process_numberif__name__=='__main__':processes=[]manager=Manager()foriinxrange(4):p=Process(target=f,args=(i,))p.start()processes.append(p)try:forprocessinprocesses:process.join()exceptKeyboardInterrupt:print"Keyboard interrupt in main"finally:print"Cleaning up Main"
The abbreviated output you get is as follows:
1234567891011
^C
Keyboard interrupt in process: 3
Keyboard interrupt in process: 0
Keyboard interrupt in process: 2
cleaning up thread 3
cleaning up thread 0
cleaning up thread 2
Keyboard interrupt in process: 1
cleaning up thread 1
Keyboard interrupt in main
Cleaning up Main
The main take aways are:
Keyboard interrupt gets send to each sub process and main execution
the order in which the run is non-determanistic
Axiom Two: Beware multiprocessing.Manager (time to share memory between processes)
If it is possible in your stack to rely on a database, such as redis for keeping track of shared state — I recommend it. But if you need a pure python solution read on:
Managers provide a way to create data which can be shared between different processes. A manager object controls a server process which manages shared objects. Other processes can access the shared objects by using proxies.
The key take away there is that the Manager actually kicks off a server process to manage state. Its like it is firing up your own little (not battle tested) private database. And if you Ctr-C your python process the manager will get the signal and shut it self down causing all sorts of weirdness.
frommultiprocessingimportProcess,Managerfromtimeimportsleepdeff(process_number,shared_array):try:print"starting thread: ",process_numberwhileTrue:shared_array.append(process_number)sleep(3)exceptKeyboardInterrupt:print"Keyboard interrupt in process: ",process_numberfinally:print"cleaning up thread",process_numberif__name__=='__main__':processes=[]manager=Manager()shared_array=manager.list()foriinxrange(4):p=Process(target=f,args=(i,shared_array))p.start()processes.append(p)try:forprocessinprocesses:process.join()exceptKeyboardInterrupt:print"Keyboard interrupt in main"foriteminshared_array:# raises "socket.error: [Errno 2] No such file or directory"printitem
Try running that and interrupting it was a Ctr-C, you will get a weird error:
You will get a socket.error: [Errno 2] No such file or directory when trying to access the shared_array. And thats because the Manager process has been interrupted.
There is a solution!
Axiom Two: Explicitly use multiprocessing.manangers.SyncManager to share state
and use the signals library to have the SyncManager ignore the interrupt signal (SIG_INT)
frommultiprocessingimportProcessfrommultiprocessing.managersimportSyncManagerimportsignalfromtimeimportsleep# initializer for SyncManagerdefmgr_init():signal.signal(signal.SIGINT,signal.SIG_IGN)print'initialized manager'deff(process_number,shared_array):try:print"starting thread: ",process_numberwhileTrue:shared_array.append(process_number)sleep(3)exceptKeyboardInterrupt:print"Keyboard interrupt in process: ",process_numberfinally:print"cleaning up thread",process_numberif__name__=='__main__':processes=[]# now using SyncManager vs a Managermanager=SyncManager()# explicitly starting the manager, and telling it to ignore the interrupt signalmanager.start(mgr_init)try:shared_array=manager.list()foriinxrange(4):p=Process(target=f,args=(i,shared_array))p.start()processes.append(p)try:forprocessinprocesses:process.join()exceptKeyboardInterrupt:print"Keyboard interrupt in main"foriteminshared_array:# we still have access to it! Yay!printitemfinally:# to be safe -- explicitly shutting down the managermanager.shutdown()
Main take aways here are:
Explicitly using and starting a SyncManager (instead of Manager)
on its initialization having it ignore the interrupt
I will do a future post on gracefully shutting down child threads (once I figure that out ;–)
Thanks to @armsteady, who showed me the like on StackOverflow (link)
In the age or SaaS, and working with 3rd part APIs developers often have to navigate a complex object (arrays of hashes of arrays or hashes) (I am looking at you Adwords API)
I wanted a nice way to avoid doing None checks and does this key exist over and over again.
So I made a (very) simple utility to help with it dict_digger
importdict_diggerh={'a':{'b':'tuna','c':'fish'},'b':{}}result=dict_digger.dig(h,'a','b')printresult# prints 'tuna'result=dict_digger.dig(h,'c','a')printresult# prints None# Important!! Does not through an error, just returns None#but if you likeresult=dict_digger.dig(h,'c','a',fail=True)# raises a KeyError# also support complex objects so ...complex={'a':{['tuna','fish']},'b':{}}result=dict_digger.dig(complex,'a',0)printresult#prints tuna
I think it is good to shuffle the team around. Helps with cross-pollination, and keeps the team area neat. Here is the function that we use to randomize our team making sure that you do not sit next to someone you are already sitting next to.
Note: Only works with teams greater than four. Assign each space in your office an number, the run the following. The first person in the outputted array goes in space 1, and so on.
importrandomdefall_perms(elements):iflen(elements)<=1:yieldelementselse:forperminall_perms(elements[1:]):foriinrange(len(elements)):#nb elements[0:1] works in both string and list contextsyieldperm[:i]+elements[0:1]+perm[i:]deffind_position(key,lizt):return[ifori,xinenumerate(lizt)ifx==key][0]defnew_neighbors(some_list):new_neighbor_list=some_list[:]list_size=len(some_list)fornew_neighbor_listinall_perms(some_list):printnew_neighbor_listtoo_many_neighbors=Falsefori,team_memberinenumerate(new_neighbor_list):#find position in inital listposition_in_original_list=find_position(team_member,some_list)original_neighbors=[]original_neighbors.append(some_list[(position_in_original_list+1)%list_size])original_neighbors.append(some_list[(position_in_original_list-1)%list_size])new_neighbors=[]new_neighbors.append(new_neighbor_list[(i+1)%list_size])new_neighbors.append(new_neighbor_list[(i-1)%list_size])delta=len(set(new_neighbors)-set(original_neighbors))#print "for {} comparing: {} with {} = {}".format(team_member,original_neighbors,new_neighbors,delta)ifnotdelta==2:too_many_neighbors=Truebreakiftoo_many_neighbors==False:returnnew_neighbor_listelse:print"No Matches"return[]# Usage team=['JT','FS','MC','MA','FD']new_seating=new_neighbors(team)printnew_seating# >> ['MC', 'JT', 'MA', 'PS', 'FD']
To end with an quote to motivate:
Everyday I’m shufflin’ — LMFAO
(you can play that music as you are shuffling’ seats)