WalkDir is my Python support library that aims to make it as easy to work with filtered directory listings as it is to walk over entire directory trees with os.walk().
The module's design tries to take full advantage of Python's iterator model - most of its functionality is provided by pipelined iterators that accept os.walk() style iterables and expose the same interface themselves.
The only major functional change in version 0.3 is that these pipelined iterators now make sure they pass along the objects produced by the underlying iterators, and only use indexing operations to access the individual fields. Previously they would use tuple unpacking to access the directory details, which restricted the supported types to those with exactly 3 fields and also had the side effect of replacing the underlying objects with ordinary 3-tuples.
I changed this mainly due to a new OS interface that is likely to be coming in Python 3.3: an os.walk() variant that produces a 4-tuple rather than a 3-tuple. The 4th value will be a file descriptor for the directory making it easier (in conjunction with new file descriptor based APIs in the 3.3 os module) to write filesystem modification code that is robust against symlink attacks. By passing the underlying objects through unmodified, WalkDir is now compatible with this API - all the path based filtering will still work, but the file descriptor values will also be passed along correctly.
For those that haven't seen any of my previous comments on WalkDir, the other parts of the API are just there for convenience - one factory
function that constructs pipelines for you, and 3 terminal iterators
that flatten out the os.walk() style triples into a simple series of
paths (all paths, just the visited directories or just the file paths).
The other notable change in 0.3 is the list of officially supported versions. Previously, the module was only known to work on 2.7 and 3.2+ (since they're the versions I have on my home development machine). However, thanks to a free open source account provided by the folks at Shining Panda, WalkDir 0.3 is known to work on Python 2.6, 2.7 and 3.1+ (I even test it on PyPy and Stackless, just because I can). After pushing a broken package to PyPI for 0.2, I even have a sanity check I can run that ensures the module can be downloaded with pip and then imported on all the supported versions.
Boredom & Laziness
There are a couple of very, very scary things in this world.
The first is a bored human. Bored humans have time to indulge their curiosity, with potentially amazing results.
The second is a lazy human. Lazy humans can be quite inventive when it comes to figuring out how to do less work.
So, here's to boredom & laziness - two of the prime movers in human progress!
Tuesday, January 31, 2012
Wednesday, January 18, 2012
Using the SOPA protests to highlight related problems in Australia
I
figure this is the easiest place to publish the message I just sent to
Larissa Waters, the Greens Senator that is one of Queensland's
representatives in the Federal Senate. I also wrote to Yvette D'ath (our
local MHR) a few days ago, but I didn't keep a copy of that one. Will
this achieve anything? Probably not, but hey, at least I tried (and if
none of their constituents ever write to them about it, our reps are
quite entitled to assume we're all OK with them selling out the county
to legacy US media interests):
Senator Waters,
With today being the day Wikipedia and a wide range of other sites have either gone dark or taken other action to protest draconian internet censorship legislation making its way through the US Congress, it seems an opportune time to highlight our own government's ongoing concerning behaviour on that front.
Of particular concern is their continuing refusal to release details of a secretive meeting between government representatives and representatives of the same organisations that are behind the draconian US bills currently being protested. The government even deliberately excluded representatives of a number of community interest organisations that sought to attend these discussions.
These legacy media companies (aka horse drawn carriage manufacturers) are flailing around wildly as the rise of free and open digital communications networks (aka automobiles) threatens the cherished gatekeeper role they have enjoyed for the past few decades as media distributors. They have failed to adapt, and are increasingly being bypassed as artists, writers, musicians, comedians and other media creators find ways to use the power of the internet to connect more directly with their fans. These direct connections are great for both artists and fans, but place the intermediaries like YouTube, Apple iTunes, Amazon, BandCamp, Flickr, etc, in the role of service providers to the artists and fans rather than gatekeepers to widespread distribution. Unfortunately, instead of going gracefully into that good night, these organisations are investing inordinate sums of money worldwide in lobbying for legislation that would make the permissive, open practices of most of these new service providers a recipe for prohibitively high legal liabilities, effectively making those practices unsustainable and thus breaking the internet as we know it today.
Australia already markedly shifted many intellectual monopoly policies to favour the interests of US copyright holders at the expense of Australian citizens when we signed the US-FTA some time ago. We have also participated in the secretive process of drafting the Anti-Counterfeiting Trade Agreement, which spends far more time considering digital copyright infringement than it does actual counterfeiting. The current negotiations over membership in the Trans-Pacific Partnership agreement raise legitimate fears that Australia's intellectual monopoly policy will be shifted even further towards the draconian position of the United State Trade Representative, even as those policies are being protested strongly within the US itself.
In line with your published policy on community participation in government, do the Greens plan to publicly question the government over their apparent willingness to place the interest of large US companies ahead of those of individual Australian citizens?
Regards,
Nick.
Thursday, December 22, 2011
New Year Python Meme - December 2011
I'm normally a curmudgeon about this kind of thing, but I enjoyed reading some of the other posts in this series Tarek kicked off, so I decided to make my own contribution.
1. What's the coolest Python application, framework or library you have discovered in 2011?
The move to Red Hat marked my entry into the world of web development (previously I'd merely been in interested observer of that world, rather than a participant). By far my favourite discovery since making that change is django-rest-framework - with that, I can use my web browser to browse early iterations of my server's REST API directly, without needing to write custom clients to process the JSON data from APIs that are still in a state of flux.
As a service, ReadTheDocs has been an absolute revelation - between that, code hosting & issue management sites like BitBucket and GitHub and of course PyPI itself, it's now possible for an open source project to have a quite respectable web presence without the developers needing to understand anything more than Sphinx, source control and the project they're working on.
2. What new programming technique did you learn in 2011?
REST would be the big one. I'd had some general exposure to the concept in the past, but there's no substitute for sitting down and building it into a product when it comes to understanding a programming or API design technique.
3. What's the name of the open source project you contributed the most in 2011? What did you do?
CPython, by far - kibbitzing on python-dev and python-ideas (and import-sig too these days), writing and reviewing several different PEPs, documentation updates, code reviews and patch applications, as well as working on my own things (including the still-in-progress integration work for the 'yield from' expression that's coming in 3.3).
I also recently started up 4 separate open source projects - 3 PyPI modules to hopefully address deficiencies I see in the current standard library offerings, plus the upstream open source project for my current development efforts at Red Hat:
4. What was the Python blog or website you read the most in 2011?
Planet Python.
5. What are the three top things you want to learn in 2012?
From a work point of view, getting my RHCSA (Red Hat Certified System Administrator) is at the top of the list. Coming up to speed on AMQP (Advanced Message Queuing Protocol) is a close second. Finally, I want to fill in more of the gaps in my very sketchy knowledge of web UI development (i.e. HTML/CSS/Javascript).
6. What are the top software, app or lib you wish someone would write in 2012?
I want to see the __preview__ namespace (in particular, the regex module) make it into Python 3.3. But that requires a volunteer to step up and write the PEP, write the code and generally champion the idea (if we have to wait for me to do it, there's no way it will happen before 3.4).
Want to do your own list? here's how:
1. What's the coolest Python application, framework or library you have discovered in 2011?
The move to Red Hat marked my entry into the world of web development (previously I'd merely been in interested observer of that world, rather than a participant). By far my favourite discovery since making that change is django-rest-framework - with that, I can use my web browser to browse early iterations of my server's REST API directly, without needing to write custom clients to process the JSON data from APIs that are still in a state of flux.
As a service, ReadTheDocs has been an absolute revelation - between that, code hosting & issue management sites like BitBucket and GitHub and of course PyPI itself, it's now possible for an open source project to have a quite respectable web presence without the developers needing to understand anything more than Sphinx, source control and the project they're working on.
2. What new programming technique did you learn in 2011?
REST would be the big one. I'd had some general exposure to the concept in the past, but there's no substitute for sitting down and building it into a product when it comes to understanding a programming or API design technique.
3. What's the name of the open source project you contributed the most in 2011? What did you do?
CPython, by far - kibbitzing on python-dev and python-ideas (and import-sig too these days), writing and reviewing several different PEPs, documentation updates, code reviews and patch applications, as well as working on my own things (including the still-in-progress integration work for the 'yield from' expression that's coming in 3.3).
I also recently started up 4 separate open source projects - 3 PyPI modules to hopefully address deficiencies I see in the current standard library offerings, plus the upstream open source project for my current development efforts at Red Hat:
- contextlib2 (ContextStack has some potential as a new building block)
- WalkDir (the idea here is to be the "itertools for os.walk()")
- Shell Command (let Python handle control flow, the shell actual commands)
- PulpDist (Bringing a semblance of order to small-scale rsync mirror networks)
4. What was the Python blog or website you read the most in 2011?
Planet Python.
5. What are the three top things you want to learn in 2012?
From a work point of view, getting my RHCSA (Red Hat Certified System Administrator) is at the top of the list. Coming up to speed on AMQP (Advanced Message Queuing Protocol) is a close second. Finally, I want to fill in more of the gaps in my very sketchy knowledge of web UI development (i.e. HTML/CSS/Javascript).
6. What are the top software, app or lib you wish someone would write in 2012?
I want to see the __preview__ namespace (in particular, the regex module) make it into Python 3.3. But that requires a volunteer to step up and write the PEP, write the code and generally champion the idea (if we have to wait for me to do it, there's no way it will happen before 3.4).
Want to do your own list? here's how:
- copy-paste the questions and answer to them in your blog
- tweet it with the #2012pythonmeme hashtag
Friday, December 16, 2011
Help improve the Python 3.3 Standard Library...
... and hopefully help yourself with current programming projects, too.
Some recent programming activities left me underwhelmed by a few of the standard library's included batteries. This has already led to a significant revamp of the subprocess module documentation to steer new users away from the Popen Swiss army knife (unless they really need it) and to explain the commonly needed parameters more clearly. It still needs work (the notes and warnings are far too repetitive), but it at least introduces things in the right order now (high level convenience API that most people want first, lower level Popen API that some people need second).
However, for 3.3 I'd like to improve things even more in at least three areas: invocation of the system shell for administration tasks, better tools for traversing filesystem directories and programmatic management of deterministic resource cleanup (i.e. not relying on the garbage collector).
Accordingly, I have 3 projects up on PyPI (with docs on ReadTheDocs and source control and issue tracking on BitBucket):
Feedback on any and all of these is appreciated, either here or on the respective issue trackers. It isn't a foregone conclusion that any of these APIs will be added at all, so examples of real world use cases would definitely be helpful.
Some recent programming activities left me underwhelmed by a few of the standard library's included batteries. This has already led to a significant revamp of the subprocess module documentation to steer new users away from the Popen Swiss army knife (unless they really need it) and to explain the commonly needed parameters more clearly. It still needs work (the notes and warnings are far too repetitive), but it at least introduces things in the right order now (high level convenience API that most people want first, lower level Popen API that some people need second).
However, for 3.3 I'd like to improve things even more in at least three areas: invocation of the system shell for administration tasks, better tools for traversing filesystem directories and programmatic management of deterministic resource cleanup (i.e. not relying on the garbage collector).
Accordingly, I have 3 projects up on PyPI (with docs on ReadTheDocs and source control and issue tracking on BitBucket):
- WalkDir: os.walk() style iterators with file and directory filtering (both inclusion and exclusion), depth limiting and symlink loop detection, as well as convenience iterators to flatten os.walk() style iterators into a series of paths (either all walked paths, just the directories or just the files). I currently plan to make (at least some of) these part of the shutil module, but exactly what gets added will be based on the feedback I receive on this module and its API design.
- Shell Command: Convenience APIs that combine subprocess invocation with string interpolation. Interpolated strings are escaped with shlex.quote() by default, with a custom conversion specifier ("!u", for unquoted) used to invoke the standard interpolation process. It also features an experimental API where I'm tinkering with the use of select.select() on subprocess pipes (I'm not sure it achieves a lot over simple blocking IO in its current form, though). The current plan for this API is that it will be added directly to the subprocess module (well, the stable and sensible parts will be, anyway - I still have my doubts about the select.select() experiment)
- contextlib2: This module basically exists to let me publish and gather feedback on ContextStack, a proposed addition to contextlib for 3.3 that should make it easier to manage deterministic resource cleanup programmatically (i.e. without coupling it as directly to code layout as simple with statements do).
Feedback on any and all of these is appreciated, either here or on the respective issue trackers. It isn't a foregone conclusion that any of these APIs will be added at all, so examples of real world use cases would definitely be helpful.
Saturday, October 08, 2011
Correcting ignorance: learning a bit about Ruby blocks
Gary Bernhardt pointed out at PyCodeConf that I didn't know Ruby even half as well as I should if I wanted to really understand why Ruby programmers rave about blocks so much (I started this before his talk, but it touches on his key point about the centrality of blocks to Ruby's design, and Python's lack of a similarly endemic model for code interleaving). So I set about trying to fix that (at least, to the extent I can in 24 hours or so). Unsurprisingly (since I'm not interested in becoming a Ruby programmer at this point in time), I approached this task more in terms of what it could teach me about Python (and its limitations) rather than in figuring out the full ins and outs of idiomatic Ruby. So feel free to bring it up in the comments if you think I've fundamentally mischaracterised some aspect of Ruby here.
Python also has two kinds of function: ordinary functions and generator functions. The name binding semantics are identical, but the invocation style and semantics are very different. Lambda expressions and generator expressions provide syntax for defining these inside an expression, but under the hood the semantics are still the same as those of the statement versions.
The closest you can get to a Ruby style anonymous procedure in Python is to create a named inner function and declare every otherwise local variable explicitly 'nonlocal' (in Python 3 - nonlocal declarations aren't available in Python 2). Then all name binding operations in the inner scope would also affect those names in the outer scope.
Hence, where Ruby has specific syntactic sugar for passing a block of code to another method (do-end), Python instead has syntactic sugar for various invocation styles for coroutines (iteration via for loops, transactional code via with statements).
It's also the case that coroutines are not (yet) as deeply bound into Python's semantics as blocks are into Ruby. Whereas Ruby had blocks from the beginning and defined key programming constructs in terms of them (such as iteration and transactional style code via blocks), Python instead is built around various task specific protocols that may *optionally* be implemented in terms of coroutines (e.g. for loops, the iterator protocol and generators, the with statement, the context manager protocol and the contextlib.contextmanager decorator applied to a generator).
During the discussions that led to the introduction of the with statement in Python 2.5, Guido made a clear, conscious design decision: he wanted the possible flows of control through the function body to be visible locally inside a function, without being dependent on the definitions of other methods (raising exceptions, of course, being an exception - catching them, though, largely obeys this guideline). Most code is run immediately, code in if statements and exception handlers is run zero or one times, code in loops is run zero or more times, code in nested function definitions is executed at some later time when the function is called. The Ruby blocks design is the antithesis of this: your control flow is entirely dependent on the methods you call. The downside of wanting visible control flow, of course, is that iteration, transactional code and callback programming all end up looking different at the point of invocation. (If you read PEP 340, Guido's original proposal for what eventually became the with statement, and contrast it with PEP 343, the version that we finally implemented, you'll see that his original idea was a fair bit closer to Ruby's blocks in power and scope).
So Ruby's flexibility comes at a price: when you pass a block to a method, you need to know what that method does in order to know how it affects your local control flow. Naming conventions can help reduce that complexity (such as the .each convention for iteration), but it does move control flow into the domain of programming conventions rather than the language definition.
On the other hand, Python's choice of explicit control flow comes at a price in flexibility: callback programming looks starkly different to ordinary programming as you have to construct explicit closures in order to pass chunks of code around.
By contrast, Python's generators were originally output only, reflecting their target use case of iteration. You could input some initial data via parameters, but couldn't readily supply data to a running calculation. This has started to change in recent years, as generators now provide send() and throw() methods to pass data back in, and yield became an expression in order to provide access to the 'send()' argument. However, these features do not, at this stage, have deep syntactic support - there's a fairly obvious mapping from continue to send() and break to throw() that would tie them into the for loop syntax, but this capability has not garnered significant support when it has been brought up (I believe because it doesn't really help with the last major code execution model that Python doesn't provide nice native support for: callback programming).
In Python 3.3, generators will gain the ability to return values, and better syntax for invoking them and getting that value, moving the language even further towards full coroutine support (see PEP 380 for details). However, that is merely the next step along the path rather than arrival at the destination.
Ruby's way has a definite elegance to it (despite the hidden control flow). I think aspiring to that kind of elegance for callback programming in Python would be a good thing, even if the semantic model is completely different (i.e. coroutine based rather than block based). The addition of actual block functionality remains unlikely, however - if they were as powerful as Ruby blocks, then it would create two ways to do too many things (with no obvious criteria to choose between the current technique and the block based technique), but if they were strictly less powerful, then reusing Ruby's block terminology would likely be confusing rather than enlightening. For better or for worse, Python is now well down the path of coroutine based programming and we likely need to see how far we can take that model rather than trying to shoehorn in yet another approach.
The first distinction: two kinds of function
It turns out the first distinction shows up at quite a fundamental level. Ruby has two kinds of function: named methods and anonymous procedures. The semantics of these are quite different, most notably that named methods create their own local namespace, while anonymous procedures just use the namespace of the method that created them (so they're almost like ordinary local code).Python also has two kinds of function: ordinary functions and generator functions. The name binding semantics are identical, but the invocation style and semantics are very different. Lambda expressions and generator expressions provide syntax for defining these inside an expression, but under the hood the semantics are still the same as those of the statement versions.
The closest you can get to a Ruby style anonymous procedure in Python is to create a named inner function and declare every otherwise local variable explicitly 'nonlocal' (in Python 3 - nonlocal declarations aren't available in Python 2). Then all name binding operations in the inner scope would also affect those names in the outer scope.
Actually, make that three kinds of function
The named method vs anonymous procedure distinction actually doesn't fully capture Ruby's semantics. Blocks (which is what I was most interested in learning about), add a new set of semantics that don't apply to the full object versions: they not only use the namespace of the defining method for their local variables, but their parameters are pass-by-reference (so they can rebind names in the calling namespace) and their control flow can affect the calling method (i.e. a return from a block will cause the calling method to return, not just the block itself). While somewhat interesting, I don't think these are actually all that significant - the core semantic difference is the one between Ruby's anonymous closures and Python's generators, not the dynamic binding behaviour of blocks.The implications: blocks versus coroutines
This initial difference in the object model for code execution has created a fundamental difference in the way the two languages approach the problem of interleaving distinct pieces of code. The Ruby way is to define a separate piece of the current function that can be passed to other code and invoked as if it was still inline in its original location, then resuming execution when the called operation is complete. The Python way is to suspend execution, hand control back to the invoking piece of code, and then resume execution of the current code block at a later time (as determined by the invoking code).Hence, where Ruby has specific syntactic sugar for passing a block of code to another method (do-end), Python instead has syntactic sugar for various invocation styles for coroutines (iteration via for loops, transactional code via with statements).
It's also the case that coroutines are not (yet) as deeply bound into Python's semantics as blocks are into Ruby. Whereas Ruby had blocks from the beginning and defined key programming constructs in terms of them (such as iteration and transactional style code via blocks), Python instead is built around various task specific protocols that may *optionally* be implemented in terms of coroutines (e.g. for loops, the iterator protocol and generators, the with statement, the context manager protocol and the contextlib.contextmanager decorator applied to a generator).
Callback programming and hidden control flow
One interesting outgrowth of the Ruby approach is that callback programming actually becomes a fairly natural extension of the way the language works - since programming with blocks is callback style programming, the invoking code doesn't really care if the called method runs the passed in block immediately or at some later time. Whether you consider this a good thing or a bad thing is going to depend on how you feel about the merits and dangers of hidden control flow.During the discussions that led to the introduction of the with statement in Python 2.5, Guido made a clear, conscious design decision: he wanted the possible flows of control through the function body to be visible locally inside a function, without being dependent on the definitions of other methods (raising exceptions, of course, being an exception - catching them, though, largely obeys this guideline). Most code is run immediately, code in if statements and exception handlers is run zero or one times, code in loops is run zero or more times, code in nested function definitions is executed at some later time when the function is called. The Ruby blocks design is the antithesis of this: your control flow is entirely dependent on the methods you call. The downside of wanting visible control flow, of course, is that iteration, transactional code and callback programming all end up looking different at the point of invocation. (If you read PEP 340, Guido's original proposal for what eventually became the with statement, and contrast it with PEP 343, the version that we finally implemented, you'll see that his original idea was a fair bit closer to Ruby's blocks in power and scope).
So Ruby's flexibility comes at a price: when you pass a block to a method, you need to know what that method does in order to know how it affects your local control flow. Naming conventions can help reduce that complexity (such as the .each convention for iteration), but it does move control flow into the domain of programming conventions rather than the language definition.
On the other hand, Python's choice of explicit control flow comes at a price in flexibility: callback programming looks starkly different to ordinary programming as you have to construct explicit closures in order to pass chunks of code around.
Two way data flow
With their functional API, blocks natively supported two-way data flow from the beginning: data was passed in by calling them, and then either returned as the result of the block or by manipulating the passed in name bindings.By contrast, Python's generators were originally output only, reflecting their target use case of iteration. You could input some initial data via parameters, but couldn't readily supply data to a running calculation. This has started to change in recent years, as generators now provide send() and throw() methods to pass data back in, and yield became an expression in order to provide access to the 'send()' argument. However, these features do not, at this stage, have deep syntactic support - there's a fairly obvious mapping from continue to send() and break to throw() that would tie them into the for loop syntax, but this capability has not garnered significant support when it has been brought up (I believe because it doesn't really help with the last major code execution model that Python doesn't provide nice native support for: callback programming).
In Python 3.3, generators will gain the ability to return values, and better syntax for invoking them and getting that value, moving the language even further towards full coroutine support (see PEP 380 for details). However, that is merely the next step along the path rather than arrival at the destination.
Reinventing blocks
I think the folks who accuse us of (slowly) reinventing blocks have a valid point - Python really is on the path of devising ways to handle tasks neatly with coroutines (i.e. functions that can be suspended and transparently resumed later without losing any internal state) that Ruby handles via blocks (i.e. extracting arbitrary fragments of a function body and passing them to other code). The fact that generators were not built into Python from the outset but instead have been added later to make certain kinds of code easier to write does show through in a variety of ways - coroutine based code often doesn't play nicely with ordinary imperative code and vice-versa.Ruby's way has a definite elegance to it (despite the hidden control flow). I think aspiring to that kind of elegance for callback programming in Python would be a good thing, even if the semantic model is completely different (i.e. coroutine based rather than block based). The addition of actual block functionality remains unlikely, however - if they were as powerful as Ruby blocks, then it would create two ways to do too many things (with no obvious criteria to choose between the current technique and the block based technique), but if they were strictly less powerful, then reusing Ruby's block terminology would likely be confusing rather than enlightening. For better or for worse, Python is now well down the path of coroutine based programming and we likely need to see how far we can take that model rather than trying to shoehorn in yet another approach.
Wednesday, September 28, 2011
Spinning up the pulpdist project
One novel aspect of the pulpdist project is that it is starting with an almost completely blank slate from a technology point of view (aside from the decision to use Pulp as the main component of the mirroring network). Red Hat does have development standards for internal projects, of course (especially in the messaging space), but they're fairly flexible, leaving the individual tool development teams with a lot of options. If something ships with Fedora and/or RHEL, or is available under licensing terms that would be acceptable for inclusion in Fedora (and subsequently RHEL), then it's fair game.
This post focuses on the design of the management server. I'll write up a separate post looking at the currently planned design for the Pulp data transfer plugins.
While I expect Pyramid/Pylons would also have been able to do the job, I decided to go with Django 1.3. This was heavily influenced by social factors: I know a lot of Django devs that I can bug for advice, but the same is not true for Pyramid. The complexity of the whole Pyramid/Pylons/TurboGears setup is also not appealing - while veteran web developers may find the "you decide" approach a selling point, Django's batteries included approach makes it far simpler to get started quickly, and decide as I go along which pieces I should keep, discard or replace.
I've heard some experienced Django developers muttering complaints about the class based views design in 1.3, but as someone coming in that is an experienced Python developer, but a relatively noobish web developer, the CBV approach seems eminently sensible, while the old function based approach looks repetitive and insane. Object oriented programming was invented for a reason!
I'll admit that my perception may be biased by knowing exactly how to make multiple inheritance work the way I want it to, though :)
I've also punted on any web caching questions for now - the management server is low traffic and once the access to the Pulp sites is pushed out to a backend service, it should be fast enough at least for the early iterations.
Integration with Pulp's user access controls is via OAuth, but the design for configuration of user permissions in the Pulp servers is still TBD.
Schema and data maintenance is handled using South.
For data table display, I'm using Django Tables 2 and form display will use Django Uni-Form.
The design of the REST API was heavily influenced by this Lessons Learned piece from the RHEV-M developers. The Django Rest Framework means I can just define the data I want to display as a list or dictionary and the framework takes care of formatting it nicely, including rendering URLs as hyperlinks.
Testing Regime
As the foundation for the automated testing, I'm going with Django Sane Testing (mainly based on the example of other internal Django projects). Michael Foord's mock module lets me run at least some of the tests without relying on an external Pulp instance (fortunately, the namespace conflict with Fedora's RPM building utility 'mock' was recently resolved with the latter's support library being renamed to 'mockbuild').
Continuous integration is an open question at this point. Pulp uses Jenkins for CI and I'm inclined to follow their lead. The other main possibility is to use Beaker, Red Hat's internal test system originally set up for kernel testing (one key attraction Beaker offers is the ability to set up multi-server multi-site testing in a test recipe so I can run tests over the internal WAN).
Packaging
Tito is a tool for generating SRPMs and RPMs directly from a Git repository. For my own packages, this is the approach I'm using (with handcrafted spec files). For some strange reason, the sysadmins around here like it when internal devs provide things as pre-packaged RPMs for deployment :)
Packaging of upstream PyPI dependencies that aren't available as Fedora or RHEL packages is still a work in progress. I experimented with Tito and git submodules (which doesn't work) and git subtrees (which does work, but is seriously ugly). My next attempt is likely to be based on py2pack, so we'll see how that goes (I actually discovered that project by searching for 'cpanspec pypi' after hearing some of the Perl folks here extolling the virtues of cpanspec for easily packaging CPAN modules as RPMs).
I also need to switch to using virtualenv to get a clearer distinction between Fedora packages I added via yum install and stuff I picked up directly from PyPI with pip.
This post focuses on the design of the management server. I'll write up a separate post looking at the currently planned design for the Pulp data transfer plugins.
Source Control
Unsurprisingly, Red Hat's internal processes are heavily influenced by Linux kernel processes. Accordingly, the source control tool of choice for new projects is Git. While I have a slight preference for Mercurial (due mainly to familiarity), I'm happy enough with any DVCS, so Git it is.Primary Development Language
Python, of course. You don't hire a CPython core developer to get them to work on a Ruby or Perl project (although the current system I'm replacing was written in Perl). As a web application, there will naturally be some Javascript and CSS involved as well.Web Framework
The main management application for pulpdist is going to be a full-scale web application. User profiles and authentication, database storage, communication with other web services, provision of a REST API, integration with the engineering tools messaging bus. Basically, micro-frameworks need not apply.While I expect Pyramid/Pylons would also have been able to do the job, I decided to go with Django 1.3. This was heavily influenced by social factors: I know a lot of Django devs that I can bug for advice, but the same is not true for Pyramid. The complexity of the whole Pyramid/Pylons/TurboGears setup is also not appealing - while veteran web developers may find the "you decide" approach a selling point, Django's batteries included approach makes it far simpler to get started quickly, and decide as I go along which pieces I should keep, discard or replace.
I've heard some experienced Django developers muttering complaints about the class based views design in 1.3, but as someone coming in that is an experienced Python developer, but a relatively noobish web developer, the CBV approach seems eminently sensible, while the old function based approach looks repetitive and insane. Object oriented programming was invented for a reason!
I'll admit that my perception may be biased by knowing exactly how to make multiple inheritance work the way I want it to, though :)
Web Server
The management server doesn't actually have that much work to do, so the basic Apache+mod_wsgi configuration will serve as an adequate starting point (any heavy lifting will be done by the individual Pulp instances, and the main data traffic on those doesn't run through their web service). WSGI provides the flexibility to revisit this later if needed.I've also punted on any web caching questions for now - the management server is low traffic and once the access to the Pulp sites is pushed out to a backend service, it should be fast enough at least for the early iterations.
Authentication & Authorisation
The actual user authentication task will be handed off to Apache and all management application access will be restricted to Kerberos authenticated users over SSL. Django's own permissions systems will be used to handle authorisation restrictions. (The experimental prototype will use Basic Auth instead, since it is the Apache/Django integration the prototype needs to cover, not the Apache configuration for SSL and Kerberos authentication)Integration with Pulp's user access controls is via OAuth, but the design for configuration of user permissions in the Pulp servers is still TBD.
Database and ORM
Again, the management server isn't doing the heavy lifting in this application. The Pulp instances use MongoDB, but for the management server I currently plan to use the standard Django ORM backed by PostgreSQL. For the prototype instance, the database is actually just an SQLite3 file. I'm not quite sold on this one as yet - it's tempting to start playing with SQLAlchemy, since I've already had to hack around some of the limitations in the native ORM in order to store encrypted fields. OTOH, I already have a ton of things to do on this project, so messing with this is a long way down the priority list.Schema and data maintenance is handled using South.
HTML Templating
The standard Django templating engine should be sufficient for my needs. As with the ORM, it's tempting to look into upgrading it to something like Jinja2, but once again 'good enough' is likely to be the deciding factor.For data table display, I'm using Django Tables 2 and form display will use Django Uni-Form.
REST API
The REST API for the service is currently there primarily as a development aid - it lets me publish the full data model to the web as soon as it stabilises (and even while its still in flux), even if the UI for end users hasn't been fully defined. This is particularly useful for the metadata coming back from the Pulp server, since it doesn't need much post-processing to be included as raw data in the management server's own REST API. The JSON interface will also allow much of the backend processing to be fully exercised by the test suite without worrying about web UI details.The design of the REST API was heavily influenced by this Lessons Learned piece from the RHEV-M developers. The Django Rest Framework means I can just define the data I want to display as a list or dictionary and the framework takes care of formatting it nicely, including rendering URLs as hyperlinks.
AMQP Messaging
I haven't actually started on this aspect in any significant way, but the two main contenders I've identified are python-qpid (which is what Pulp uses) and django-celery (which would also give me an internal task queue engine, which the management server is going to need - the prototype just does everything in the Django process, which is OK for experimentation on the LAN, but clearly inadequate long term when talking to multiple sites distributed around the planet). At this early stage, I expect the internal task management aspect is going to tip the decision in favour of the latter.Testing Regime
As the foundation for the automated testing, I'm going with Django Sane Testing (mainly based on the example of other internal Django projects). Michael Foord's mock module lets me run at least some of the tests without relying on an external Pulp instance (fortunately, the namespace conflict with Fedora's RPM building utility 'mock' was recently resolved with the latter's support library being renamed to 'mockbuild').Continuous integration is an open question at this point. Pulp uses Jenkins for CI and I'm inclined to follow their lead. The other main possibility is to use Beaker, Red Hat's internal test system originally set up for kernel testing (one key attraction Beaker offers is the ability to set up multi-server multi-site testing in a test recipe so I can run tests over the internal WAN).
Packaging
Tito is a tool for generating SRPMs and RPMs directly from a Git repository. For my own packages, this is the approach I'm using (with handcrafted spec files). For some strange reason, the sysadmins around here like it when internal devs provide things as pre-packaged RPMs for deployment :)Packaging of upstream PyPI dependencies that aren't available as Fedora or RHEL packages is still a work in progress. I experimented with Tito and git submodules (which doesn't work) and git subtrees (which does work, but is seriously ugly). My next attempt is likely to be based on py2pack, so we'll see how that goes (I actually discovered that project by searching for 'cpanspec pypi' after hearing some of the Perl folks here extolling the virtues of cpanspec for easily packaging CPAN modules as RPMs).
I also need to switch to using virtualenv to get a clearer distinction between Fedora packages I added via yum install and stuff I picked up directly from PyPI with pip.
Friday, September 09, 2011
Mirror All The Things!
After describing the project I'm working on to a few people at PyConAU and BrisPy, I decided it might be a good idea to blog about it here. I do have a bit of an ulterior motive in doing so, though - I hope people will point out when I've missed useful external resources or applications, or when something I'm planning to do doesn't make sense to the assorted Django developers I know. Yes, that's right - I'd like to make being wrong on the internet work in my favour :)
The project is purely internal at this stage, but I hope to be able to publish it as open source somewhere down the line. Even being able to post these design concepts is pretty huge for me personally, though - before starting with Red Hat a few months ago, I spent the previous 12 and a half years working in the defence industry, which is about as far from Red Hat's "Default to Open" philosophy as it's possible to get.
There are various use cases and constraints that mean the mirroring system needs to operate at the filesystem level without making significant assumptions about the contents of the trees being mirrored (due to various details of the use cases involved, block level replication and approaches that rely on the transferred data being laid out in specific ways aren't viable alternatives for this project). The current incarnation of this system relies almost entirely on that venerable workhorse of the mirroring world, rsync.
However, the current system is also showing its age and has a few limitations that make it fairly awkward to work with. Notably, there's no one place to go to get an overview of the entire internal mirroring setup, and the direct use of rsync means it isn't particularly friendly with other applications when it comes to sharing WAN bandwidth and the servers involved are wasting quite a few cycles recalculating the same deltas for multiple clients. Hence, the project I am working on, which is intended to replace the existing system with something a bit more efficient and easier to manage, while also providing a better platform for adding new features.
The Pulp project is currently in the process of migrating from their original yum-specific architecture to a more general purpose Generic Content plugin architecture. It's that planned plugin architecture that makes Pulp a useful basis for the next generation internal mirroring system, which, at least for now, I am imaginatively calling pulpdist (referring to both "distribution with Pulp", since that's what the system does, and "distributed Pulp instances", since that's how the system will work).
The main components of the initial pulpdist architecture will be:
I'll be writing more on various details that I consider interesting as I go along. Initially, that will include my plan for the mirroring protocol to be used between the sites, as well as various decisions that need to be made when spinning up a Django project from scratch (while many of my specific answers are shaped by the target environment for internal deployment, the questions I needed to consider should be fairly widely applicable).
The project is purely internal at this stage, but I hope to be able to publish it as open source somewhere down the line. Even being able to post these design concepts is pretty huge for me personally, though - before starting with Red Hat a few months ago, I spent the previous 12 and a half years working in the defence industry, which is about as far from Red Hat's "Default to Open" philosophy as it's possible to get.
Mirror, Mirror, On The Wall
The project Red Hat hired me to implement is the next generation of their internal mirroring system, which is used for various tasks, such as getting built versions of RHEL out to the hardware compatibility testing labs (and, when they're large enough, returning the generated log files to the relevant development sites), or providing internal Fedora mirrors at the larger Red Hat offices (such as the one here in Brisbane).There are various use cases and constraints that mean the mirroring system needs to operate at the filesystem level without making significant assumptions about the contents of the trees being mirrored (due to various details of the use cases involved, block level replication and approaches that rely on the transferred data being laid out in specific ways aren't viable alternatives for this project). The current incarnation of this system relies almost entirely on that venerable workhorse of the mirroring world, rsync.
However, the current system is also showing its age and has a few limitations that make it fairly awkward to work with. Notably, there's no one place to go to get an overview of the entire internal mirroring setup, and the direct use of rsync means it isn't particularly friendly with other applications when it comes to sharing WAN bandwidth and the servers involved are wasting quite a few cycles recalculating the same deltas for multiple clients. Hence, the project I am working on, which is intended to replace the existing system with something a bit more efficient and easier to manage, while also providing a better platform for adding new features.
Enter Pulp
Pulp is an open source (Python) project created by Red Hat to make it easier to manage private yum repositories. Via Katello, Pulp is one of the upstream components for Red Hat's CloudForms product.The Pulp project is currently in the process of migrating from their original yum-specific architecture to a more general purpose Generic Content plugin architecture. It's that planned plugin architecture that makes Pulp a useful basis for the next generation internal mirroring system, which, at least for now, I am imaginatively calling pulpdist (referring to both "distribution with Pulp", since that's what the system does, and "distributed Pulp instances", since that's how the system will work).
The main components of the initial pulpdist architecture will be:
- a front-end (Django 1.3) web app providing centralised management of the entire distribution network
- custom importer and distributor plugins for Pulp to handle distribution of tree changes within the distribution network
- custom importer plugins to handle the import of trees from their original sources and generation of any additional metadata needed by the internal distribution plugins
- generic (and custom, if needed) plugins to make the trees available to the applications that need them
I'll be writing more on various details that I consider interesting as I go along. Initially, that will include my plan for the mirroring protocol to be used between the sites, as well as various decisions that need to be made when spinning up a Django project from scratch (while many of my specific answers are shaped by the target environment for internal deployment, the questions I needed to consider should be fairly widely applicable).
Subscribe to:
Posts (Atom)