Wednesday, January 18, 2012

Using the SOPA protests to highlight related problems in Australia

I figure this is the easiest place to publish the message I just sent to Larissa Waters, the Greens Senator that is one of Queensland's representatives in the Federal Senate. I also wrote to Yvette D'ath (our local MHR) a few days ago, but I didn't keep a copy of that one. Will this achieve anything? Probably not, but hey, at least I tried (and if none of their constituents ever write to them about it, our reps are quite entitled to assume we're all OK with them selling out the county to legacy US media interests):
Senator Waters,

With today being the day Wikipedia and a wide range of other sites have either gone dark or taken other action to protest draconian internet censorship legislation making its way through the US Congress, it seems an opportune time to highlight our own government's ongoing concerning behaviour on that front.

Of particular concern is their continuing refusal to release details of a secretive meeting between government representatives and representatives of the same organisations that are behind the draconian US bills currently being protested. The government even deliberately excluded representatives of a number of community interest organisations that sought to attend these discussions.

These legacy media companies (aka horse drawn carriage manufacturers) are flailing around wildly as the rise of free and open digital communications networks (aka automobiles) threatens the cherished gatekeeper role they have enjoyed for the past few decades as media distributors. They have failed to adapt, and are increasingly being bypassed as artists, writers, musicians, comedians and other media creators find ways to use the power of the internet to connect more directly with their fans. These direct connections are great for both artists and fans, but place the intermediaries like YouTube, Apple iTunes, Amazon, BandCamp, Flickr, etc, in the role of service providers to the artists and fans rather than gatekeepers to widespread distribution. Unfortunately, instead of going gracefully into that good night, these organisations are investing inordinate sums of money worldwide in lobbying for legislation that would make the permissive, open practices of most of these new service providers a recipe for prohibitively high legal liabilities, effectively making those practices unsustainable and thus breaking the internet as we know it today.

Australia already markedly shifted many intellectual monopoly policies to favour the interests of US copyright holders at the expense of Australian citizens when we signed the US-FTA some time ago. We have also participated in the secretive process of drafting the Anti-Counterfeiting Trade Agreement, which spends far more time considering digital copyright infringement than it does actual counterfeiting. The current negotiations over membership in the Trans-Pacific Partnership agreement raise legitimate fears that Australia's intellectual monopoly policy will be shifted even further towards the draconian position of the United State Trade Representative, even as those policies are being protested strongly within the US itself.

In line with your published policy on community participation in government, do the Greens plan to publicly question the government over their apparent willingness to place the interest of large US companies ahead of those of individual Australian citizens?

Regards,
Nick.

Thursday, December 22, 2011

New Year Python Meme - December 2011

I'm normally a curmudgeon about this kind of thing, but I enjoyed reading some of the other posts in this series Tarek kicked off, so I decided to make my own contribution.

1. What's the coolest Python application, framework or library you have discovered in 2011?
The move to Red Hat marked my entry into the world of web development (previously I'd merely been in interested observer of that world, rather than a participant). By far my favourite discovery since making that change is django-rest-framework - with that, I can use my web browser to browse early iterations of my server's REST API directly, without needing to write custom clients to process the JSON data from APIs that are still in a state of flux.

As a service, ReadTheDocs has been an absolute revelation - between that, code hosting & issue management sites like BitBucket and GitHub and of course PyPI itself, it's now possible for an open source project to have a quite respectable web presence without the developers needing to understand anything more than Sphinx, source control and the project they're working on.

2. What new programming technique did you learn in 2011?
REST would be the big one. I'd had some general exposure to the concept in the past, but there's no substitute for sitting down and building it into a product when it comes to understanding a programming or API design technique.

3. What's the name of the open source project you contributed the most in 2011? What did you do?
CPython, by far - kibbitzing on python-dev and python-ideas (and import-sig too these days), writing and reviewing several different PEPs, documentation updates, code reviews and patch applications, as well as working on my own things (including the still-in-progress integration work for the 'yield from' expression that's coming in 3.3).

I also recently started up 4 separate open source projects - 3 PyPI modules to hopefully address deficiencies I see in the current standard library offerings, plus the upstream open source project for my current development efforts at Red Hat:
  • contextlib2 (ContextStack has some potential as a new building block)
  • WalkDir (the idea here is to be the "itertools for os.walk()")
  • Shell Command (let Python handle control flow, the shell actual commands)
  • PulpDist (Bringing a semblance of order to small-scale rsync mirror networks)

4. What was the Python blog or website you read the most in 2011?
Planet Python.

5. What are the three top things you want to learn in 2012?
From a work point of view, getting my RHCSA (Red Hat Certified System Administrator) is at the top of the list. Coming up to speed on AMQP (Advanced Message Queuing Protocol) is a close second. Finally, I want to fill in more of the gaps in my very sketchy knowledge of web UI development (i.e. HTML/CSS/Javascript).

6. What are the top software, app or lib you wish someone would write in 2012?
I want to see the __preview__ namespace (in particular, the regex module) make it into Python 3.3. But that requires a volunteer to step up and write the PEP, write the code and generally champion the idea (if we have to wait for me to do it, there's no way it will happen before 3.4).

Want to do your own list? here's how:
  1. copy-paste the questions and answer to them in your blog
  2. tweet it with the #2012pythonmeme hashtag

Friday, December 16, 2011

Help improve the Python 3.3 Standard Library...

... and hopefully help yourself with current programming projects, too.

Some recent programming activities left me underwhelmed by a few of the standard library's included batteries. This has already led to a significant revamp of the subprocess module documentation to steer new users away from the Popen Swiss army knife (unless they really need it) and to explain the commonly needed parameters more clearly. It still needs work (the notes and warnings are far too repetitive), but it at least introduces things in the right order now (high level convenience API that most people want first, lower level Popen API that some people need second).

However, for 3.3 I'd like to improve things even more in at least three areas: invocation of the system shell for administration tasks, better tools for traversing filesystem directories and programmatic management of deterministic resource cleanup (i.e. not relying on the garbage collector).

Accordingly, I have 3 projects up on PyPI (with docs on ReadTheDocs and source control and issue tracking on BitBucket):
  • WalkDir: os.walk() style iterators with file and directory filtering (both inclusion and exclusion), depth limiting and symlink loop detection, as well as convenience iterators to flatten os.walk() style iterators into a series of paths (either all walked paths, just the directories or just the files). I currently plan to make (at least some of) these part of the shutil module, but exactly what gets added will be based on the feedback I receive on this module and its API design.
  • Shell Command: Convenience APIs that combine subprocess invocation with string interpolation. Interpolated strings are escaped with shlex.quote() by default, with a custom conversion specifier ("!u", for unquoted) used to invoke the standard interpolation process. It also features an experimental API where I'm tinkering with the use of select.select() on subprocess pipes (I'm not sure it achieves a lot over simple blocking IO in its current form, though). The current plan for this API is that it will be added directly to the subprocess module (well, the stable and sensible parts will be, anyway - I still have my doubts about the select.select() experiment)
  • contextlib2: This module basically exists to let me publish and gather feedback on ContextStack, a proposed addition to contextlib for 3.3 that should make it easier to manage deterministic resource cleanup programmatically (i.e. without coupling it as directly to code layout as simple with statements do).

Feedback on any and all of these is appreciated, either here or on the respective issue trackers. It isn't a foregone conclusion that any of these APIs will be added at all, so examples of real world use cases would definitely be helpful.

Saturday, October 08, 2011

Correcting ignorance: learning a bit about Ruby blocks

Gary Bernhardt pointed out at PyCodeConf that I didn't know Ruby even half as well as I should if I wanted to really understand why Ruby programmers rave about blocks so much (I started this before his talk, but it touches on his key point about the centrality of blocks to Ruby's design, and Python's lack of a similarly endemic model for code interleaving). So I set about trying to fix that (at least, to the extent I can in 24 hours or so). Unsurprisingly (since I'm not interested in becoming a Ruby programmer at this point in time), I approached this task more in terms of what it could teach me about Python (and its limitations) rather than in figuring out the full ins and outs of idiomatic Ruby. So feel free to bring it up in the comments if you think I've fundamentally mischaracterised some aspect of Ruby here.

The first distinction: two kinds of function

It turns out the first distinction shows up at quite a fundamental level. Ruby has two kinds of function: named methods and anonymous procedures. The semantics of these are quite different, most notably that named methods create their own local namespace, while anonymous procedures just use the namespace of the method that created them (so they're almost like ordinary local code).

Python also has two kinds of function: ordinary functions and generator functions. The name binding semantics are identical, but the invocation style and semantics are very different. Lambda expressions and generator expressions provide syntax for defining these inside an expression, but under the hood the semantics are still the same as those of the statement versions.

The closest you can get to a Ruby style anonymous procedure in Python is to create a named inner function and declare every otherwise local variable explicitly 'nonlocal' (in Python 3 - nonlocal declarations aren't available in Python 2). Then all name binding operations in the inner scope would also affect those names in the outer scope.

Actually, make that three kinds of function

The named method vs anonymous procedure distinction actually doesn't fully capture Ruby's semantics. Blocks (which is what I was most interested in learning about), add a new set of semantics that don't apply to the full object versions: they not only use the namespace of the defining method for their local variables, but their parameters are pass-by-reference (so they can rebind names in the calling namespace) and their control flow can affect the calling method (i.e. a return from a block will cause the calling method to return, not just the block itself). While somewhat interesting, I don't think these are actually all that significant - the core semantic difference is the one between Ruby's anonymous closures and Python's generators, not the dynamic binding behaviour of blocks.

The implications: blocks versus coroutines

This initial difference in the object model for code execution has created a fundamental difference in the way the two languages approach the problem of interleaving distinct pieces of code. The Ruby way is to define a separate piece of the current function that can be passed to other code and invoked as if it was still inline in its original location, then resuming execution when the called operation is complete. The Python way is to suspend execution, hand control back to the invoking piece of code, and then resume execution of the current code block at a later time (as determined by the invoking code).

Hence, where Ruby has specific syntactic sugar for passing a block of code to another method (do-end), Python instead has syntactic sugar for various invocation styles for coroutines (iteration via for loops, transactional code via with statements).

It's also the case that coroutines are not (yet) as deeply bound into Python's semantics as blocks are into Ruby. Whereas Ruby had blocks from the beginning and defined key programming constructs in terms of them (such as iteration and transactional style code via blocks), Python instead is built around various task specific protocols that may *optionally* be implemented in terms of coroutines (e.g. for loops, the iterator protocol and generators, the with statement, the context manager protocol and the contextlib.contextmanager decorator applied to a generator).

Callback programming and hidden control flow

One interesting outgrowth of the Ruby approach is that callback programming actually becomes a fairly natural extension of the way the language works - since programming with blocks is callback style programming, the invoking code doesn't really care if the called method runs the passed in block immediately or at some later time. Whether you consider this a good thing or a bad thing is going to depend on how you feel about the merits and dangers of hidden control flow.

During the discussions that led to the introduction of the with statement in Python 2.5, Guido made a clear, conscious design decision: he wanted the possible flows of control through the function body to be visible locally inside a function, without being dependent on the definitions of other methods (raising exceptions, of course, being an exception - catching them, though, largely obeys this guideline). Most code is run immediately, code in if statements and exception handlers is run zero or one times, code in loops is run zero or more times, code in nested function definitions is executed at some later time when the function is called. The Ruby blocks design is the antithesis of this: your control flow is entirely dependent on the methods you call. The downside of wanting visible control flow, of course, is that iteration, transactional code and callback programming all end up looking different at the point of invocation. (If you read PEP 340, Guido's original proposal for what eventually became the with statement, and contrast it with PEP 343, the version that we finally implemented, you'll see that his original idea was a fair bit closer to Ruby's blocks in power and scope).

So Ruby's flexibility comes at a price: when you pass a block to a method, you need to know what that method does in order to know how it affects your local control flow. Naming conventions can help reduce that complexity (such as the .each convention for iteration), but it does move control flow into the domain of programming conventions rather than the language definition.

On the other hand, Python's choice of explicit control flow comes at a price in flexibility: callback programming looks starkly different to ordinary programming as you have to construct explicit closures in order to pass chunks of code around.

Two way data flow

With their functional API, blocks natively supported two-way data flow from the beginning: data was passed in by calling them, and then either returned as the result of the block or by manipulating the passed in name bindings.

By contrast, Python's generators were originally output only, reflecting their target use case of iteration. You could input some initial data via parameters, but couldn't readily supply data to a running calculation. This has started to change in recent years, as generators now provide send() and throw() methods to pass data back in, and yield became an expression in order to provide access to the 'send()' argument. However, these features do not, at this stage, have deep syntactic support - there's a fairly obvious mapping from continue to send() and break to throw() that would tie them into the for loop syntax, but this capability has not garnered significant support when it has been brought up (I believe because it doesn't really help with the last major code execution model that Python doesn't provide nice native support for: callback programming).

In Python 3.3, generators will gain the ability to return values, and better syntax for invoking them and getting that value, moving the language even further towards full coroutine support (see PEP 380 for details). However, that is merely the next step along the path rather than arrival at the destination.

Reinventing blocks

I think the folks who accuse us of (slowly) reinventing blocks have a valid point - Python really is on the path of devising ways to handle tasks neatly with coroutines (i.e. functions that can be suspended and transparently resumed later without losing any internal state) that Ruby handles via blocks (i.e. extracting arbitrary fragments of a function body and passing them to other code). The fact that generators were not built into Python from the outset but instead have been added later to make certain kinds of code easier to write does show through in a variety of ways - coroutine based code often doesn't play nicely with ordinary imperative code and vice-versa.

Ruby's way has a definite elegance to it (despite the hidden control flow). I think aspiring to that kind of elegance for callback programming in Python would be a good thing, even if the semantic model is completely different (i.e. coroutine based rather than block based). The addition of actual block functionality remains unlikely, however - if they were as powerful as Ruby blocks, then it would create two ways to do too many things (with no obvious criteria to choose between the current technique and the block based technique), but if they were strictly less powerful, then reusing Ruby's block terminology would likely be confusing rather than enlightening. For better or for worse, Python is now well down the path of coroutine based programming and we likely need to see how far we can take that model rather than trying to shoehorn in yet another approach.

Wednesday, September 28, 2011

Spinning up the pulpdist project

One novel aspect of the pulpdist project is that it is starting with an almost completely blank slate from a technology point of view (aside from the decision to use Pulp as the main component of the mirroring network). Red Hat does have development standards for internal projects, of course (especially in the messaging space), but they're fairly flexible, leaving the individual tool development teams with a lot of options. If something ships with Fedora and/or RHEL, or is available under licensing terms that would be acceptable for inclusion in Fedora (and subsequently RHEL), then it's fair game.

This post focuses on the design of the management server. I'll write up a separate post looking at the currently planned design for the Pulp data transfer plugins.

Source Control

Unsurprisingly, Red Hat's internal processes are heavily influenced by Linux kernel processes. Accordingly, the source control tool of choice for new projects is Git. While I have a slight preference for Mercurial (due mainly to familiarity), I'm happy enough with any DVCS, so Git it is.

Primary Development Language

Python, of course. You don't hire a CPython core developer to get them to work on a Ruby or Perl project (although the current system I'm replacing was written in Perl). As a web application, there will naturally be some Javascript and CSS involved as well.

Web Framework

The main management application for pulpdist is going to be a full-scale web application. User profiles and authentication, database storage, communication with other web services, provision of a REST API, integration with the engineering tools messaging bus. Basically, micro-frameworks need not apply.

While I expect Pyramid/Pylons would also have been able to do the job, I decided to go with Django 1.3. This was heavily influenced by social factors: I know a lot of Django devs that I can bug for advice, but the same is not true for Pyramid. The complexity of the whole Pyramid/Pylons/TurboGears setup is also not appealing - while veteran web developers may find the "you decide" approach a selling point, Django's batteries included approach makes it far simpler to get started quickly, and decide as I go along which pieces I should keep, discard or replace.

I've heard some experienced Django developers muttering complaints about the class based views design in 1.3, but as someone coming in that is an experienced Python developer, but a relatively noobish web developer, the CBV approach seems eminently sensible, while the old function based approach looks repetitive and insane. Object oriented programming was invented for a reason!

I'll admit that my perception may be biased by knowing exactly how to make multiple inheritance work the way I want it to, though :)

Web Server

The management server doesn't actually have that much work to do, so the basic Apache+mod_wsgi configuration will serve as an adequate starting point (any heavy lifting will be done by the individual Pulp instances, and the main data traffic on those doesn't run through their web service). WSGI provides the flexibility to revisit this later if needed.

I've also punted on any web caching questions for now - the management server is low traffic and once the access to the Pulp sites is pushed out to a backend service, it should be fast enough at least for the early iterations.

Authentication & Authorisation

The actual user authentication task will be handed off to Apache and all management application access will be restricted to Kerberos authenticated users over SSL. Django's own permissions systems will be used to handle authorisation restrictions. (The experimental prototype will use Basic Auth instead, since it is the Apache/Django integration the prototype needs to cover, not the Apache configuration for SSL and Kerberos authentication)

Integration with Pulp's user access controls is via OAuth, but the design for configuration of user permissions in the Pulp servers is still TBD.

Database and ORM

Again, the management server isn't doing the heavy lifting in this application. The Pulp instances use MongoDB, but for the management server I currently plan to use the standard Django ORM backed by PostgreSQL. For the prototype instance, the database is actually just an SQLite3 file. I'm not quite sold on this one as yet - it's tempting to start playing with SQLAlchemy, since I've already had to hack around some of the limitations in the native ORM in order to store encrypted fields. OTOH, I already have a ton of things to do on this project, so messing with this is a long way down the priority list.

Schema and data maintenance is handled using South.

HTML Templating

The standard Django templating engine should be sufficient for my needs. As with the ORM, it's tempting to look into upgrading it to something like Jinja2, but once again 'good enough' is likely to be the deciding factor.

For data table display, I'm using Django Tables 2 and form display will use Django Uni-Form.

REST API

The REST API for the service is currently there primarily as a development aid - it lets me publish the full data model to the web as soon as it stabilises (and even while its still in flux), even if the UI for end users hasn't been fully defined. This is particularly useful for the metadata coming back from the Pulp server, since it doesn't need much post-processing to be included as raw data in the management server's own REST API. The JSON interface will also allow much of the backend processing to be fully exercised by the test suite without worrying about web UI details.

The design of the REST API was heavily influenced by this Lessons Learned piece from the RHEV-M developers. The Django Rest Framework means I can just define the data I want to display as a list or dictionary and the framework takes care of formatting it nicely, including rendering URLs as hyperlinks.

AMQP Messaging

I haven't actually started on this aspect in any significant way, but the two main contenders I've identified are python-qpid (which is what Pulp uses) and django-celery (which would also give me an internal task queue engine, which the management server is going to need - the prototype just does everything in the Django process, which is OK for experimentation on the LAN, but clearly inadequate long term when talking to multiple sites distributed around the planet). At this early stage, I expect the internal task management aspect is going to tip the decision in favour of the latter.

Testing Regime

As the foundation for the automated testing, I'm going with Django Sane Testing (mainly based on the example of other internal Django projects). Michael Foord's mock module lets me run at least some of the tests without relying on an external Pulp instance (fortunately, the namespace conflict with Fedora's RPM building utility 'mock' was recently resolved with the latter's support library being renamed to 'mockbuild').

Continuous integration is an open question at this point. Pulp uses Jenkins for CI and I'm inclined to follow their lead. The other main possibility is to use Beaker, Red Hat's internal test system originally set up for kernel testing (one key attraction Beaker offers is the ability to set up multi-server multi-site testing in a test recipe so I can run tests over the internal WAN).

Packaging

Tito is a tool for generating SRPMs and RPMs directly from a Git repository. For my own packages, this is the approach I'm using (with handcrafted spec files). For some strange reason, the sysadmins around here like it when internal devs provide things as pre-packaged RPMs for deployment :)

Packaging of upstream PyPI dependencies that aren't available as Fedora or RHEL packages is still a work in progress. I experimented with Tito and git submodules (which doesn't work) and git subtrees (which does work, but is seriously ugly). My next attempt is likely to be based on py2pack, so we'll see how that goes (I actually discovered that project by searching for 'cpanspec pypi' after hearing some of the Perl folks here extolling the virtues of cpanspec for easily packaging CPAN modules as RPMs).

I also need to switch to using virtualenv to get a clearer distinction between Fedora packages I added via yum install and stuff I picked up directly from PyPI with pip.


Friday, September 09, 2011

Mirror All The Things!

After describing the project I'm working on to a few people at PyConAU and BrisPy, I decided it might be a good idea to blog about it here. I do have a bit of an ulterior motive in doing so, though - I hope people will point out when I've missed useful external resources or applications, or when something I'm planning to do doesn't make sense to the assorted Django developers I know. Yes, that's right - I'd like to make being wrong on the internet work in my favour :)

The project is purely internal at this stage, but I hope to be able to publish it as open source somewhere down the line. Even being able to post these design concepts is pretty huge for me personally, though - before starting with Red Hat a few months ago, I spent the previous 12 and a half years working in the defence industry, which is about as far from Red Hat's "Default to Open" philosophy as it's possible to get.

Mirror, Mirror, On The Wall

The project Red Hat hired me to implement is the next generation of their internal mirroring system, which is used for various tasks, such as getting built versions of RHEL out to the hardware compatibility testing labs (and, when they're large enough, returning the generated log files to the relevant development sites), or providing internal Fedora mirrors at the larger Red Hat offices (such as the one here in Brisbane).

There are various use cases and constraints that mean the mirroring system needs to operate at the filesystem level without making significant assumptions about the contents of the trees being mirrored (due to various details of the use cases involved, block level replication and approaches that rely on the transferred data being laid out in specific ways aren't viable alternatives for this project). The current incarnation of this system relies almost entirely on that venerable workhorse of the mirroring world, rsync.

However, the current system is also showing its age and has a few limitations that make it fairly awkward to work with. Notably, there's no one place to go to get an overview of the entire internal mirroring setup, and the direct use of rsync means it isn't particularly friendly with other applications when it comes to sharing WAN bandwidth and the servers involved are wasting quite a few cycles recalculating the same deltas for multiple clients. Hence, the project I am working on, which is intended to replace the existing system with something a bit more efficient and easier to manage, while also providing a better platform for adding new features.

Enter Pulp

Pulp is an open source (Python) project created by Red Hat to make it easier to manage private yum repositories. Via Katello, Pulp is one of the upstream components for Red Hat's CloudForms product.

The Pulp project is currently in the process of migrating from their original yum-specific architecture to a more general purpose Generic Content plugin architecture. It's that planned plugin architecture that makes Pulp a useful basis for the next generation internal mirroring system, which, at least for now, I am imaginatively calling pulpdist (referring to both "distribution with Pulp", since that's what the system does, and "distributed Pulp instances", since that's how the system will work).

The main components of the initial pulpdist architecture will be:
  • a front-end (Django 1.3) web app providing centralised management of the entire distribution network
  • custom importer and distributor plugins for Pulp to handle distribution of tree changes within the distribution network
  • custom importer plugins to handle the import of trees from their original sources and generation of any additional metadata needed by the internal distribution plugins
  • generic (and custom, if needed) plugins to make the trees available to the applications that need them

I'll be writing more on various details that I consider interesting as I go along. Initially, that will include my plan for the mirroring protocol to be used between the sites, as well as various decisions that need to be made when spinning up a Django project from scratch (while many of my specific answers are shaped by the target environment for internal deployment, the questions I needed to consider should be fairly widely applicable).

Saturday, August 27, 2011

Open Source, Windows and Teaching Python to New Developers

A few questions and incidents recently prompted me to reflect on why I don't help with CPython support on Windows, even though I use Windows happily enough on my gaming system. Since this ended up being a rather pro-Linux article and upfront disclosure is a good thing, I'll note that while I do work for Red Hat now, that's a very recent thing - my adoption of Linux as my preferred development platform dates back to 2004 or so. I work for Red Hat because I like Linux, not the other way around :)

The Availability of Professional Development Tools

I don't make a secret of my dislike of Windows as a hobbyist development platform. While Microsoft have improved things in recent years (primarily by releasing the Express editions of Visual Studio), there's still a huge difference between an operating system like GNU/Linux, which was built by developers for developers based on a foundation that was built by academics for academics, and Windows, which was built by a company that used deals with computer manufacturers to get it into end users' hands regardless of technical merit. Developers were forced to follow in order to reach that large installed user base. Those different histories are reflected in the different development cultures that surround the respective platforms.

To get the same tool chain that professional Linux companies use, you don't need to do anything special - Linux distributions include the tools used to create them. If you have a distribution, you have everything you need to build applications for that distribution, including documentation. With the open source nature of the platform and almost all of the software (the occasionally binary driver notwithstanding), there's a vast range of tools out there to help you get things done (although sorting through the mass can be a little tricky sometimes, since it can be hard to tell the difference between stuff that doesn't exist and stuff that exists, but hasn't been uncovered by your research).

As far as I'm aware, Mac OS X isn't quite as generous with freely available development utilities, but isn't all that far off the Linux approach (I'm not a Mac user or developer though, so there may be more hurdles than I am aware of - I recall some muttering about Apple beginning to charge a small fee for XCode. My opinion is based mostly on the fact that it seems pretty easy to find open source devs that use Macs). With the POSIX-ish underpinnings, many of the utilities from the *nix world also work in this environment.

The minimum realistic standard for professional Windows development, though, is an MSDN subscription (to get full access to the OS documentation and various utilities), along with a professional copy of Visual Studio. The tools available for free (including the Express editions of Visual Studio) are clearly second rate. Even when the tools themselves are OK, the licensing restrictions on the applications they create may make them practically useless (and MS have the gall to call the GPL viral - at least the gcc team don't restrict how you license and distribute the binaries it creates). So why should a hobbyist develop for a system that thinks they should pay substantial sums for the privilege of developing for it, instead of one that welcomes all contributors, providing not only the end product, but the ingredients and recipes all for free?

At the recent PyConAU sprints, one of the contributors (an existing Linux user that happened to have a Windows only laptop with them) became frustrated with getting all the necessary tools set up to work properly on Windows (configuring git+ssh for read/write access to a GitHub repo was one key point of irritation), and decided to dual boot Ubuntu on the machine instead. Twenty minutes later, she was up and running and hacking on the project she originally wanted to hack on. Granted, she already knew how to use Linux, but seriously, there's something fundamentally wrong with a platform when installing and dual-booting to a different OS is the easiest way to get a decent development environment up and running.

All that ends up putting cross-platform languages like Python in an interesting position: when developing with Python, you can often get away with not understanding the underlying details of your operating system, because the language runtime tries to provide a largely standardised interface on all platforms. However, many open source developers either don't use Windows at all, or genuinely dislike programming for it, so the burden of making things work properly on Windows falls on the shoulders of a comparatively small number of people, either those who genuinely like programming for the platform (yes, such people exist, I'm just not one of them), or those that are looking for any niche where they can usefully contribute and are happy enough to take on the task of improving Windows compatibility and support.

I don't have particularly hard numbers to back this up (other than the skew in core developer numbers vs overall OS popularity), but my intuition is that, at least for CPython, the user:core developer ratio is orders of magnitude higher for Windows than it is for Linux or Mac OS X.

The Implications for Teaching Python on Windows

Something cool that is going on at the moment is that a lot of folks are interested in the idea of teaching more people how to program with Python as the language used. However, the potential students (young and old) that they are wanting to teach often don't have any development experience at all and are using the most common consumer operating system (i.e. Windows). So good Windows support, and an easy installation experience are important considerations for these instructors. A request that is frequently made (with varying levels of respect and politeness), is that the official python.org Windows installer be updated to automatically adjust the PATH (or at least provide the option to do so), so that Python can be launched from the command line by typing "python" instead of something like "C:\Python27\python".

If educators want that right now their best bet is actually to direct their students towards the Windows versions of ActiveState's ActivePython Community Edition. ActiveState add a few things to the standard installer, like PATH manipulation and additional packages (such as pywin32). They also bundle PyPM, which is a decent tool for getting PyPI packages on to Windows machines (at least, I've heard good things about it - I haven't actually used it myself). (That said, I believe I may need to caveat that recommendation a bit: as near as I can tell from their website, PyPM has been deliberately disabled for their 64-bit Windows Community Edition installer. Still, even in that case, you can easily grab additional packages direct from PyPI via "pip install" on the command line)

Brian Curtin is working on adding optional PATH manipulation to the python.org installer for 3.3, and there's a chance such a change might be backported to the next maintenance releases for 3.2 and 2.7 (no promises, though). Even if it does make it in, it will still be a while before the change is part of a binary release (especially given that Brian has only just started tinkering with it).

This is clearly a nice thing for beginners, especially those that aren't in the habit of tinkering with their OS settings, but I do honestly wonder how much of a difference it will make in the long run. In many ways, software development is one long exercise in frustration. You decide you want to fix bug X. But it turns out bug X is really due to bug Y. You could work around Y just to fix X, but the bigger bug would still be there. But then you discover that fixing bug Y properly requires feature Z, which doesn't exist yet, so a workaround (even an ugly one) starts to sound pretty attractive. "Yak shaving" (the highly technical term for things that you're working on solely because they're a prerequisite for what you actually want to be working on) is so common it's almost the norm rather than the exception. The many and varied frustrations of trying to use Windows as a hobbyist open source developer also won't magically go away just because the python.org installer starts automating one environment variable update - as soon as people are introduced to sites like GitHub and BitBucket, they'll get to discover the joy that is SSH and source control on Windows. If they get past that hurdle, they'll likely start to encounter the multitude open source projects that don't even offer Windows installers (if the project supports Windows at all), because their Windows developer count stands at a grand total of zero.

Final Thoughts

I hope the people teaching Python to beginners on Windows and the folks working on improving Windows support don't take this article as an attack on their efforts. I find both goals to be quite admirable, and wish those involved all the success they can find. But there are reasons I abandoned Windows as a personal development platform ~7 years ago and taught myself to use Linux instead. As far as I can tell, most of those reasons remain valid today, even after Microsoft started releasing the Express versions of Visual Studio in an attempt to stem the flood of hobbyist developers jumping ship.

The other day I called the relative lack of Windows developers in open source a vicious cycle and I stand by that. If someone can learn to program, mastering Linux is going to be comparatively easy. For anyone seriously interested in open source development, using Linux (even in a virtual machine, the way I do on my gaming laptop) is by far the path of least resistance. Getting more Windows developers in open source requires that people care sufficiently about Windows as a platform that they don't just switch to Linux, but care about open source enough to start contributing at all, and that seems to be a genuinely rare combination.