04 September 2011

iPhoto and durable photo management

When managing your digital photos, there are three things you really should do:
  • Make backups of your backups of your backups.  These are your photos, don't mess about here - make backups regularly and store one of your backups someplace remote.
  • Use JPG for your file formats.  If you end up with a camera saving in some goofy format either convert to JPG or get a new camera.  JPG is like mp3 is for music - it'll be a durable photo format that will be around for a very long time and is supported by lots of tools.
  • Don't use software to organize your photos for you.  Use a simple directory format.  Software and their proprietary organization approaches will come and go but simple directories and folders will be around for a long time to come.  The following blog article is on this last point.
In 1999 when I bought my HP Photosmart C30 I started organizing my photos in a simple directory structure: Photos/YYYY/YYYYMMDD.  Something like this:

Over the years I would try whatever photo management software came with new cameras or emerging opensource packages.  l always regretted it as the photos would end up being hidden inside of some database and difficult to extract.  I would end up maintaining photos in two locations - the photo software and my simple archive.  That is, until I would get rid of the software and proprietary database files.

All that changed when I drank the Apple, iPhone and iPhoto koolaid when the iPhone came out.  The iPhoto software hits the magic "good enough" point Apple does so well and the integration with the iPhone Just Worked.  For the first few months I maintained iPhone photos in two locations - iPhoto and my simple archive.  But as I started using iPhoto more and more, using it to maintain photo albums, I just got lazy and one day stopped copying files from iPhoto into the archive.

I did continue to use my simple archive for regular camera photos but of course over time I started using the phone more and more for photos and conventional cameras only for special occasions like holidays.  I would pull in other camera pictures from the my simple archive into iPhoto to create albums.

I kept telling myself I could extract the photos any time - you can enter the iPhoto database structure as a filesystem (~/Pictures/iPhoto Library - "Show Package Contents"), similar to my simple archive approach.  iPhoto's File->Export function was there of course, but it would not properly set file modified date meaning it was more difficult to get into the right folder without looking at header data in the JPG.  Meanwhile, time goes on, and nothing lasts forever, maybe not even iPhoto.

That brings us to today and here's my strategy:  I've accepted that iPhoto will hold the "master" version of all iPhone (and more recently Samsung Galaxy s2 photos) and I'll occasionally extract these photos from iPhoto and fold them into my simple archives.

Here is how I did it:

1. Within iPhoto, Command-F search all your Photos for "iPhone".  Assuming you haven't otherwise marked your photos (title, tags, event names, ...) with "iPhone", iPhoto gives you a subset of your Photos created with the iPhone.

2. Within iPhoto, use File-Export to save these photos to the filesystem.  They'll be saved as one big set in whatever directory you specify.  I named the files with prefix "iphone" and selected the option to sequentially number them.  That way I can always go searching in my simple archive for iPhone photos by filename if I need to.

3. Use MacPort.org's "port" command to install jhead.  Run "jhead -ft *.JPG *.jpg" at the shell to correct all the modified dates so your files are date/time stamped with the date/time the picture was taken.  I also had a handful of .PNG (screen captures) and .MOV (movies) files and I just left these dates as they were.  Probably wrong, but I only had a few.

4.  I wrote a small shell script to organze the extracted pictures by date to match my simple archive format.  Here it is:

# orgpix.sh - organize iPhoto exported photos by date 
# Files will be moved to into a directory structure like this: 
#    ... 
# ... 

# Place this file in the directory containing all the files you want 
# to organize and run it from there. 

ls -1 *.jpg | while read fn; do 
   eval set `stat -f "%Sa" -t "%Y %m %d" "$fn"` 
   export year=$1 
   export month=$2 
   export day=$3 
   echo "$fn : $year $month $day" 
   mkdir $year 2> /dev/null 
   mkdir ${year}/"${year}${month}${day}" 2> /dev/null 
   mv "$fn" ${year}/"${year}${month}${day}" 

5. Used cp to merge the newly organized files and directories into my main photo archive:

cp -Rvpn 2008/ /Path/to/Photo/Archives/2008

(I was tentative.  I ran it for each year by hand as I wanted to make sure it was working properly by checking a few file and directory creations as I went.  This could be scripted as well.)

So there you go.  All photos taken by the iPhone extracted from iPhoto and merged into my simple but very durable photo archives structure.

I repeated the above to extract my Samsung Galaxy s2 (Command-F search for "GT-I9100") photos as well.

Postnote: If you can't subset out the photos taken by the iPhone within iPhoto, you could just extract everything and use jhead's EXIF header reading to determine what camera took a photo.

15 May 2011

What does it mean to be a technologist?

At some point about half way through my career, 10 or so years ago, I started referring to myself as a technologist.

Sure, I got the usual "what a geek" comments early on - I was the guy that fixed friend's computers and saved up for a horrifically expensive mobile phone when they first came out.  Yes, I found myself taking technology night classes at local Uni (pre-WWW, give me a break) and hacking on this and that for fun.  Of course I wrote automation scripts at work to reduce the tedious parts of the job, even though building the tools wasn't actually the job.

Then I got into technology management and was told I'd have to give all that up because good managers knew how to delegate individual contributor work and should spend their time managing.  That never felt right to me so I rebelled and sometime later started calling myself a technologist who also happens to also manage stuff.  Some years later I think my much more technically savvy friends will call me more of a "wannabe" technologist, but that's ok.

About 2 years ago I wrote about what I do as a technology leader.  That blog entry was done in the spirit of trying to help me understand my responsibilities at the time and whether they felt right.  Maybe it also helped others understand the kinds of responsibilities they might have in the future if they wanted to lead technology teams or tune their own technology management responsibilities.  In hindsight, I also discovered (for myself anyway) two important ideas.  First, I uncovered the notion of own and do, concluding that I valued maintaining a balance between the two.  Second, I separated practical day-to-day responsibilities from leadership concerns.

So that brings me to today.  Earlier I stumbled on an article about Skills Every IT Person Needs.  It got me thinking about my old technology leader blog entry and what skills I thought every technologist needs, at least in my world.

So what is my definition of an IT person, or in my book, technologist?

As it turns out, I still think there are two sides, just like my view 2 years ago on technology leadership: a pure technology side and a softer side.  What follows is my definition of an effective "technologist".

On the pure technology side, I think "technologist" implies curiosity, passion and some base of knowledge and skill in four areas:
  1. Foundation technology that underpins the technology you work with regularly. "Yep, I know the basic building blocks of a computer and can use them to talk theoretical performance trade-offs."
  2. Specific technology areas you directly work with regularly.  "Let's talk about how we can performance tune this Apache server."
  3. Mainstream pop technology.  This is what your non-tech friends and colleagues read in the mainstream press and want to ask you about for an "experts" view:  "Hey, can that security problem at Sony happen to us?".  It can also be just simple help requests to sort a home network problem or give advice on which smartphone to buy.
  4. Geek pop technology.  A grab-bag of technologies you learn about in geek press and from your geek friends at a very superficial level:  "So tell me, what the heck does the Large Hadron Collider do exactly?" and "Yeah, I can't really explain why I installed Android on my old iPhone."
Non-technical people will challenge technologists on the value in some of these four areas.  When tactically focused, items 1 and 2 clearly create value, so spending time in these areas is easy to justify.  Item 3 creates value when we help educate our non-tech colleagues around how to judge risk, make better tech decisions, and sometimes just lend them a hand to fix something for them.  Item 4 can be justified with learning new ideas that you can apply to items 1-3, but more often you'll just have to accept the tech guy is going to be nerdy sometimes for item 4.

As a career moves on, technologies change and perhaps the span of responsibility widens.  As a result it becomes increasingly difficult to stay on top of 1-4 above.  Difficulty is compounded by living in an age of ever-expanding and changing technologies.  A technologist copes with this by becoming more and more selective about what they dive into and how deep, as driven by how to best get things done - delivering the next product or service.  A good technologist learns their weak spots as well, and knows the areas that they simply can't dive into and instead should create leverage or ask for help.

On the softer side, I believe technologists also have a desire to produce something of value; that is, technology with a purpose.  Technologists want to build products or services that get used and appreciated.  Like taking an art appreciation class, technologists understand what they've delivered in depth and "get more out of it" when their products are used.

I fully accept that there are very successful people in technology that are not technologists.  They tend to manage things and not understand much about what they're managing, instead focusing on breadth and excelling by managing many things at once.  While this can work, a technology department needs to be careful here.  Having people managing something they don't understand is how projects can fail and businesses get sold third party products and services they don't need.  However, so long as there is someone technical around to provide targeted advice, the non-techs can create value and thrive.

However, let's say your curious about what makes up a technologist, at least my definition of one.  Maybe you want to set up an induction program for your company's technology department for new hires.  Maybe you want to create a framework to to provide guidance to less experienced staff to progress their career as a technologist.  What might you cover or recommend to them?

As inspired by Skills Every IT Person Needs, and to make this more practical, I've assembled the following checklist of knowledge and skills to be considered a technologist, both the harder and softer sides.

But first a few caveats to explain some biases below:
- I work with Internet and retail gambling systems which colors my world
- I'm pretty far removed from most technical details but that doesn't stop me from having things like "program something in Scala" on my to-do list
- I've skipped some of the items in the Skills Every IT Person Needs list that I certainly agree with, but weren't top of mind when I put my list together
- I've thought in terms of a for-profit business, but at least some apply to not-for-profit and joy-of-creating endeavors as well

Now, onto the checklist, as organized by technology category, followed by the softer attributes.

Foundation technology

  • Understand at a basic level what the major parts of a computer are.  Be able to open up a PC case and point them out: CPU, memory, disk, I/O, bus, clock.  Understand storage layers and trade-offs (cache-memory-disk-tape).
  • Understand how a browser works: HTML/CSS markup, Javascript/AJAX, HTTP, DNS, TCP/IP.
  • Understand a basic web delivery stack:  LAMP, Java, Microsoft.  For example: client side programming and markup; server-side page construction; business logic; database.
  • Understand why MVC specifically and separation of concerns generally is useful.
  • Understand a few high-use design patterns.  Facades and factories are useful.
  • Understand the difference between asynchronous and synchronous design.
  • Understand object oriented concepts: Encapsulation primarily, but inheritance and polymorphism as well
  • Understand how to design for fault tolerance: clustering
  • Understand why backups are important and some ways to implement them.
  • Understand a software development tool chain.  Use an editor and compiler to write and run some code. At least write a few simple scripts to automate something.  Be able to read simple code and understand the gist of what is going on.
  • Understand what a transaction is.  Atomic, ACID, and locking should be familiar terms.  Bonus points if you understand transaction performance implications and double bonus points for why database synchronized architectures are easier in the beginning but don't scale well later on.
  • Understand basic problem solving techniques.  Divide and conquer, process of elimination, hypothesis testing, change conditions and observe, 5 whys - there are plenty more.  Participate in hard problem solving sessions.

Specific technology areas

  • Develop at least a high level understanding of your software, systems, and infrastructure architecture.  Jump on the opportunity to be a sounding board for a colleague's architectural frustrations or contribute to a brainstorming session on how to improve the architecture.
  • Identify, offer to own, and deliver a solution to a hard problem.  Particularly between-the-cracks, no-one-seems-to-own problem.  If you see someone struggling with a hard problem, ask them if you can help.
  • Learn something new about a relevant technology on a regular basis.
  • Understand what key departments do and how they philosophically differ: software development, project management, QA/Test, change/release, technical operations.  Understand the different approaches between development and operations.  Why are the best in these two areas wired quite differently from each other?  Why is QA and change/release such hard jobs?
  • Manage or help manage a change into production like a production push of fresh code.  Run a release plan early one morning.
  • Manage or help manage a crisis situation, an unplanned downtime.  Lead a team to finding the solution.
  • Build something that hits production, goes live, people use, and earns money for the business.  Build something you find interesting.  Know how many people use it, how much they like it and how much money it earns.  It's your right, you built it (* confidentiality concerns and third party handoffs can inhibit so best effort!).
  • When you're more junior, find an area of technology and make it your own.  Become the guru, the expert, the goto-person for that area.  As you progress your career, master a couple of areas as a guru.  When you're more senior, occasionally sharpen and leverage your knowledge in these areas and make a real hands-on contribution to a project.  Moving into management doesn't mean you should give up your guru status in any area, it just means you're just not quite as good at it as you used to be (but you're still competent because you were a guru at one time!).
  • Know when your ignorance or skill deficit is hurting the business.  Go seek help, ask questions and train if you can.  A little business hurt is sometimes the cost of you learning - that can be ok, but be transparent about it.  Renegotiate your responsibilities if you simply can't do what's being asked of you.

Mainstream and Geek Pop Technology

  • Use some flavor of Windows and Unix at home.  Use what mainstream users use, at least once-in-awhile.  Manage your home systems - installs, upgrades, trouble-shooting, repairs.  Make your own backups.  Recover from a failed disk.
  • Be willing and able to fix a office issues, PC, printer, or basic network when you're visiting someone's office or a friend's home.
  • Security.  Be able to clean a friend's system of viruses and malware.  Read the mainstream articles about security failures so you can proactively talk about them with concerned colleagues.  Understand the basics of how criminals break into systems and steal data.
  • As for Geek Pop - that's up to you.  Let your interests be your guide.
Soft skills - non technical attributes that will help you create value

  • Be able to communicate.  Document, email, blog on a topic.  Create sustainable business value through creating durable enterprise knowledge using tools like wikis.  Be able to speak and present effectively in various situations from one-on-one to large audiences.
  • Understand when it's time to fix things for the short term or the long term.  Both are often important but sometimes you have to pick and recommend because you're best placed to do so.
  • Learn how things get done.  Who can you ask to do things?  Who controls priorities, allocation of resources and budget?
  • Understand how the business makes money.  You should understand how your daily work enables the business to interest customers and earn money.  What do you need to know to make intelligent decisions and prioritize tasks on daily basis to help increase revenue?
  • Understand a process your involved in end-to-end.  Rebel against the process if it's broken.  Work from within it to improve it.  Work with the people in the process to really own the process, customizing and optimizing it to the team's needs.  Make it hum.
  • Pay attention to your balance between managing and doing.  Managing is overhead, sometimes but certainly not always necessary.  Doing is creating tangible value.  If you find yourself just forwarding email back and forth, have the confidence to remove yourself from that path.  You're not creating value.
  • If you want to change jobs, perhaps to one with more seniority, then just do the job.  Too often people get hung up on not doing what they're not being paid for even though they want the job.  Consider it an investment.  If you do the job well, recognition will follow.  It has to.
  • Don't ever apologize for being a technologist.  You should take great pride in knowing what you know.  However, try understand that what excites you really doesn't do it for most non-technologists.  Find a few mainstream topics that interest you that you talk about.  Home science projects don't count.

There are many views of what it takes to be a technologist.  For me it comes down to curiosity, selective understanding of the details, trying to make things better, and creating and delivering products and services that customers like.

03 April 2011

From laptop to iPad

A few weeks ago I decided to leave my Mac OS X laptop in the rucksack for a few days and just use an iPad for everyday use as a test of the iPad's viability for typical daily use.  The work included some airplane travel and a visit to a remote office.  I also had an external Bluetooth keyboard with me in case I had to do a lot of heads-down writing.

The following is an examination of why the iPad doesn't yet fill my laptop's boots.

The Basics

Browsing generally works fine, even a few fiddly banking sites.  I miss being able to save sites to a storage area for later reading.  While there are services like Instapaper, I want to save pages to place where they can be search indexed (a filesystem!).

Email isn't a replacement for laptop email.  Most fundamentally, you can't create a new folder in your IMAP folder set.  That's a deal-breaker.  You also can't restore from trash or mark junk email.  And you can't meaningfully search email content from within the email app.  Lack of deep content search is another deal breaker.

Skype works fine, except no video (I'm on an original iPad).

Flipping between local country SIM cards for data services as I moved between countries worked fine.  New APN settings popped in without any manual intervention.  I sometimes have to reboot between SIM changes or at least flip the iPad in and out of flight mode for the new SIM to start working.  Annoying, but not earth-shattering.

Note Taking

I tried using mail messages, Plaintext, and Notes for taking notes.

I ended up primarily using "Notes".  Flipping between notes is useful.  The iPad is particularly good in meetings, especially one-on-ones as it doesn't get in the way between you and others like the display of a laptop does.

I tend to eventually convert Notes to emails for long-term archiving in IMAP, or text snips are extracted to put in docs.

However, Notes isn't perfect.  There are problems with how Notes synchronizes between devices and being on and off line.  I commonly end up with multiple copies of the same note even though I've only been using one device to edit the note with.

Plaintext's dropbox integration is compelling.  If Plaintext supported rtf files (default file type of OS X's notepad app) and some basic markup, I'd probably switch to it from Notes as my primary note taking tool.

Taking notes in the email app isn't viable because the message you're writing effectively locks down the email app interface.  I need to be able to quickly flip between taking notes and checking email without having to save and restore my notes from the draft folder.

It would be very useful to have some basic text markup capabilities in Notes, Mail and Plaintext such as lists, bold, and highlight.

Offline usage

A lack of access to files is annoying.  iOS forces a strong siloing of data by application with some sharing, but not uniformly enforced.  Flexible cut-and-paste and an underlying filesystem are very useful and general purpose ways to share information between apps and neither exists on iOS.

To an extent, dropbox and me.com can act as a filesystem.  However, without control over caching policies, being able to pre-load the cache, and lack of universal integration, they just don't cut it as a real filesystem substitute.

Ideal integration for me would be between apps like mail, Plaintext, and iWork with dropbox and iDisk working properly under them.  In particular the iWork apps (equivalents to Pages, Numbers, and Keynote) work very poorly relative to iDisk, little less dropbox.

For example, when I open an Excel doc via Numbers from an email attachment, it's a one-time shot.  The file is not being stored in a central, synchroized spot if I make changes.  I have to remember the doc is being held in Numbers and extract it later.

Network/Cloud Storage - dropbox and iDisk

Dropbox continues to wow me and iDisk continues to disappoint.  I'm continuing to put more and more into Dropbox.  About the only items I don't have in Dropbox now are large storage requirements items such as my local storage mail archives, iPhoto database, software downloads, Music, Sites, and vmware+images.  Adding all of these would mean a significant annual cost for dropbox.

iDisk continues to be painfully slow and I'm not fully confident of its syncing (n.b., I just found another legacy folder structure artifact with no files under it that shouldn't have been there).  I continue to use iDisk for backups only as I still don't trust it or find it sufficiently usable for primary storage.

Even if iDisk synchronization and app integration is radically improved, it still suffers from poor adoption and multi-device availability as compared to Dropbox.


The iPad needs some way to dock to a keyboard, monitor and mouse in a generalized (not app by app specific) way.  Otherwise, creating and maintaining Numbers, Keynote and Pages documents on the iPad is pretty-much out of the question as the screen is too small.  At best the iWork applications can be used for quick last-minute edits.

The lack of meaningful search is a deal killer.  I really miss search as directly integrated into Mail and the Desktop.  Search is likely a resource hog to maintain - memory, disk, power.  The iOS desktop search is just too shallow, not readily available and focused within apps.

I'd really like cache management under explicit control.  Email, attachments, Notes, Dropbox, and iDisk.  I'd like to be able to "preload cache" when I have a fat wifi connection available.  I'd like to be able to specify the maximum size of each cache.  I'd like to enable "background" caching so the app isn't restricted to only caching when it's in the foreground.

There is no undo or ^Z in mail (and I assume most other apps) in case I accidentally delete something.

Pages was so slow as to be effectively unusable on a 20 page MS Word document.

I often encounter links to pages I want to save for later review.  They can be from email, a news site (e.g., BBC), general surfing (e.g., Safari), tweets (e.g., tweetdeck), or a feeds manager (e.g., flipboard).  I want to save these pages to a place that is search indexed (e.g., via spotlight).  I end up having to mail the link to myself and browse/save the page later from my laptop.  I don't consider instapaper or Evernote to be adequate.

On the positive side, the convenience due to the iPad's size is tough to beat.  It strikes a good balance between portability and usability.  You can unobtrusively carry and use it almost everywhere.  And the long battery life means a full day of use between charges.


Switching to full-time use of an iPad would be possible if:
  • Apple significantly improves their synchronization capability, both with IMAP (Notes) and iDisk.
  • Apple provides a better note-taking tool with some basic formatting, ability to flip quickly between notes, and integration to Dropbox (or an iDisk that works as well as Dropbox).
  • Add Undo in a standard way to most apps that edit, move or delete objects.
  • Add a pervasive search function that is available within applications and implicitly scopes its search to that application data.
  • Be able to create IMAP folders in mail
  • Better, explicit cache management for cloud storage
  • Better way to save web pages to a centralized search-indexed location
The new iPads have been out a few weeks as I write this.  They should help with a few of the issues I mentioned above (Skype video, speed), but most issues are software and iOS design related.

Dropbox continues to shine as the best network, synchronized, shared, multi-device/OS filesystem on the market.  Adding improved cache controls by device would be a great feature addition.  iOS (or any) app makers would be wise to provide seamless integration to Dropbox for cloud file management.

Given all these iPad/iOS issues, the iPad remains appropriately only for more casual and mobile meetings.  It is still only a supplement for my laptop, not a replacement.

(For completeness, here are other apps I'm using regularly during a typical week for work.  They all function adequately:  Lonely Planet City Guides, FlightTrack, Flipboard, Kindle, PlainText, Tweetdeck (add LinkedIn integration!), Skype, and GoodReader.)

19 March 2011

Tightening the Definition of SaaS and Cloud

I've recently been exposed to two vendors offering "cloud" and "SaaS" options to replace two in-house legacy enterprise/corporate (not customer facing production) systems.

In this process, I connected some mental dots that there are really a few flavors of SaaS, and the distinction is quite important with respect to enterprise architecture.

The two service offerings can be roughly thought of in this way:
  • The offerings were touted as SaaS and cloud
  • New software that is better than our current in-house legacy systems (regardless of whether we host or they are "in the cloud")
  • The software is hosted by the software provider, unknown what type of "cloud" IaaS is under that provider, if any (perhaps just virtualization in their own DC).
  • The software instance is spun up by the provider specifically for us.  It is a copy of the software, dedicated to us.
  • The software can be extended a lot - add-on modules can be activated through configuration changes, bespoke modules/code can be added.  Kinda-sorta like a pick-and-mix or evolving PaaS model
  • Software upgrades must be rolled out with associated consideration of any bespoke changes that have been made.
  • Security restricted to only be available within your corporate intranet
  • Flat monthly rate per user charging model with volume (# of users) price breaks
As the two service reviews went on, the dots finally connected, and I realized I had been *marketed* too more effectively than I'd like to admit.

The above isn't "cloud" or SaaS, at least not with the definition I'm going to take here.  It is actually a hosted managed service offering (MSP or ASP).  At best it's a halfway-house to cloud and SaaS.  All you've really done with this approach is shift some techops and infrastructure responsibilities from in-house to the service provider and reduced your in-house economies of scale (assuming you have to maintain those skills).

For something to be a cloud/SaaS offering in my terms, here is what it needs to be:
  • Public Internet facing
  • One centralized installation shared by many customers
    • Powering the service is an IaaS
    • Can quickly scale up/down with virtually no cost to make the change (costs changing proportional to increased/decreased usage)
    • Horizontal fault tolerance design (HW redundancy becomes irrelevant)
  • Focused offering
    • Service addresses a specific functional requirement, it isn't an omnibus offering
    • Vibrant user community making suggestions of how to improve the product
    • Quick time to market for new features
    • Strong product management and vision
  • Product improvements put live appear immediately for all customers
    • One exception: "beta" version may be option in by the customer, but certainly under the customer, not vendor, control  
    • No rolling upgrades for each customer once a new release is ready
  • A complete set of APIs ("API as a storefront")
    • Almost all functionality available via the application is available via API
    • Well documented
    • Hardened (API security, rate limits, et al)
    • Ready for mash-up integration with other focused offerings
  • Usage based billing
    • Proportional to amount of computation, storage, and connectivity you use (IaaS transparency)
    • Additionally factoring in the value of the SaaS itself
    • No billing related to seats, users, or CPU cores
In noting the difference between the two, I'm not advocating one or the other.  The choice of course depends on circumstances and strategy.  I'm also making no effort to address the common enterprise concerns of cloud such as security, data ownership, and business continuity.  However, I do have a very strong view which way the IT world is going and given the choice, I know which I'd select.

18 March 2011

Conclusions from Betfair's Outage

Niall Wass and Tony McAlister of betfair recently published a summary of betfair's 6 hour outage on 12 March 2011.  What follows is a review of their analysis.

Most of betfair's customers will have no idea what Niall and Tony are talking about.  "This [policy] should give maximum stability throughout a busy week that includes the Cheltenham Festival, cricket World Cup and Champions League football" is about the only non-tech part of the article that their customers can relate to.  However, for us technologists, the post provides some tasty detail for us to learn from other's mistakes.

The post is consistent with a growing and positive trend of tech oriented companies disclosing at least some technical detail of what happens to cause failures and what is to be done about it in the future.  Some benefits from this approach:
1. Apologize to your customers if you mess them about - always a good thing to do when you mess them about (Easyjet or Ryan Air - I hope you're reading this).  Even better is to offer your customers a treat - unfortunately betfair only alluded to one and didn't provide concrete commitment.
2. Give public sector analysts some confidence that this publicly traded company isn't about to capsize with technical problems
3. Receive broad review and possibly feedback about the failure.  Give specialist suppliers a chance to pitch to help out in potentially new and creative ways.
4. As a way to drive internal strategy and funding processes in a direction they otherwise might not be moving.

Level of change tends to be inversely proportional to stability.  "In a normal week we make at least 15 changes to the Betfair website…".   This is a powerful lesson that many non-tech people do not understand - the more you shove change into a system, the more you tend to decrease it's stability.  This statement also tips us that betfair has not adopted more progressive devops and continuous delivery trends to more safely pushing change into production.  

The change control thinking continues with "… but we have resolved not to release any new products or features for the next seven days".  This is absolutely the right thing to do when you're having stability issues.  Shut down the change pipeline immediately to anything other than highly targeted stability improvements.  Make no delivery of new features a "benefit" to the customer (improved stability) and send a hard statement to noisy internal product managers to take a deep breath and come back next week to push their agenda.

Although betfair might not be up on their devops and continuous delivery, they have followed the recent Internet services trend of being able to selectively shut down aspects of their service to preserve other aspects:
- "we determined that we needed our website 'available' but with betting disallowed"
- "in an attempt to quickly shed load, we triggered a process to disable some of the computationally intensive features on the site"
- "several operational protections in place to limit these types of changes during peak load"

Selective service shutdown is positive, it hints that:
1. The architecture is at least somewhat component based and loosely coupled.
2. There is a strategy to prioritize and switch off services under system duress

The assertion that betfair spent several hours verifying stability before opening the site to the public suggests bravery under fire.  "We recovered the site internally around 18:00 and re-enabled betting as of 20:00 once we were certain it was stable".  There must have been intense business pressure to resume earning money once it appeared the problem was solved.  However, during a major event, you want to make sure you're back to a stable state before you reopen your services.  A system can be in a delicate state when it is first opened back up to public load levels (e.g., page, code and data reload burden) which is one reason why we still like to perform system maintenance during low use hours so that the opening doors customer slam when the website/service opens are at least minimized.

The crux of the issue appears to be around content management, particularly web page publication.  Publishing content is tricky as there are two conditions that should be thoughtfully considered:
- Content being served while it is changing which results in "occasional broken pages caused by serving content" and here-and-gone content where content has been pushed to one server, but not another
- Inconsistency between related pieces of content (e.g., a promotional link on one page pointing to a new promotion page that hasn't been published yet)

It appears that betfair's content management system (CMS) is not async nor real time: "Every 15 minutes, an automated process was publishing…".  Any time a system is designed with hard time dependencies is a timebomb waiting to go off, with the trigger being increasing load: "Yesterday we hit a tipping point as the web servers reached a point where it was taking longer than 15 minutes to complete their update".  A lack of thread safe design is another indicator of a lack of async design that tends to enforce thread safety: "servers weren't thread-safe on certain types of content changes".  A batch, rather than on-demand approach is also symptomatic of the same design problem: "Unfortunately, the way this was done triggered a complete recompile of every page on our site, for every user, in every locale".  Therefore likely not an async on-demand pull model but rather a batch publish model.

The post concludes with a statement of what has been done to make sure the problem doesn't happen again:
1. "We've disabled the original automated job and rebuilt it to update content safely" - given the above design issues, while thread safety may have been addressed, until they address the fundamental synchronous design, I'd guess there will likely be other issues with it in the future.
2. "We've tripled the capacity of our web server farm to spread our load even more thinly" - hey, if you've got the money in the bank to do this, excellent.  However, it probably points to an underlying lack of capacity planning capability.  And of course, everyone one of those web servers depends on other services (app server, caches, databases, network, storage, …) - what have you done to those services by tripling demand on them?  Lots of spare capacity is great to have, but can potentially hide engineering problems.
3. "We've fixed our process for disabling features so that we won't make things worse."
4. "We've updated our operational processes and introduced a whole new raft of monitoring to spot this type of issue." - tuning monitoring, alerting, and trending system(s) after an event like this is crucial
5. "We've also isolated the underlying web server issue so that we can change our content at will without triggering the switch to single-threading"

And here are my lessons reminded and learned from the post:
- If you're having a serious problem, stop all changes that don't have to do with fixing the problem
- Selective (de)activation of loosely coupled and component services is a vital feature and design approach
- Make sure the systems are stable and strong after an event before you open the public floodgates
- Synchronous and timer based design approaches are intrinsically dangerous, especially if you're growing quickly
- Capacity planning is important, best done regularly, incrementally and organically (like most things), not in huge bangs.  One huge bang now can cause others in the future.
- Having lots of spare capacity allows you avoid problems… for awhile.  Spare capacity doesn't fix architectural issues, just delays their appearance.
- Technology is hard and technology at scale is really hard!

Niall and Tony, thanks for giving us the opportunity to learn from what happened at betfair.

05 March 2011

The Trial Environment - Innovation Infrastructure with an Enterprise wrapper


A "trial" environment is a high risk production environment that sits within a low risk Enterprise environment.

Circumstances that might drive you to set up a trial environment:
  • Business has revenues derived from enterprise production systems it wants to protect through risk management, but...
  • Business wants to move fast and be innovative, and...
  • Business wants to work with third parties, some of which are "two-guys-and-a-dog" start-ups who can't afford to focus on making their systems enterprise friendly.
The trial environment helps a technical team to balance these potentially conflicting requirements and deliver both risk- managed and risk-embracing services into the business.

(NB: Like most articles on this blog, the trial environment was conceived within a context of internet delivery systems and online gambling.  Please keep that context in mind as you read.)


But why would a production IT service want to enable lunatic startup companies and me-me-me product managers into a carefully risk managed production environment?

One reason is to foster innovation.  New innovative products tend to focus on the core features and not the "-ilities" such as scalability, stability, and security.  The new product team shouldn't be spending time on having to specify and justify a large enterprise environment when there are crucial features to be coded.  If we can provide infrastructure and costs that play by the same rules as cheeky little start-ups, we can limit their ability to end-run us.

Another reason is that from a business case perspective there is no reason to spend on big expensive kit to meet fanciful revenue forecasts.  It's much better to trend off of real data and provide an environment that can scale to a medium level quickly.

But mostly it's just nice to be able to say "yes"when the panicked bizdev guy comes over to you in desperation so he can close a deal tomorrow as opposed to "that will take 6 sign-offs, 3 months to order the kit, and will cost £400,000."  Operational process and cost intensity should match as closely as possible to revenue upsides and product complexity.

The Business Owners Point of View

How the trial environment is presented to the business, perhaps product managers, that have to pay for it:
  • Suitable for working with small, entrepreneurial, and/or external companies/teams
  • You can move quickly with it
  • Fewer sign-offs, less paperwork
  • Cheap (after a base environment is set up)
  • Enables you to focus on initial product bring-up and delivery and not overspend on an unproven product.
  • It's billed back to your project as you use it so no big up-front costs; if you're project stops, we don't have left over dead kit
  • Suitable for lower concurrent users and transaction volumes
  • Good for proof of concept projects - if project not signed off, no big capital investment
  • More risky (less stable, scalable, secure) than our enterprise environment
  • Your first point of contact if there are technical problems is the small entrepreneurial company you're working with and not IT support
  • Not particularly secure
  • Not PCI/DSS friendly (so don't store related data or encode related processes in trial)
  • Only small to medium sized products can use trial - we only have so much capacity standing by
  • If there is a failure in the trial environment, it will generally be the responsibility of the third party to fix it.  We won't know much about it.  We'll only take care of power, connectivity and hardware.
  • At a practical level, a failure in the trial environment might mean several days of downtime
  • If your revenue goes up for a product running from trial, we recommend it's moved it from trial to enterprise.  That will be your call for you to manage your revenue risks.
  • A new product that is failing will still accrue operational costs.  Pull the plug if you need to and with trial shutting down a member environment is trivial.
What happens if things take off for a product in the trial environment?  It's up to the small team or company the product manager is working with to identify this and initiate a project to "enterprise" their product.

From the Entrepreneurial Point of View

How the trial environment is offered as a production option to small, entrepreneurial, and/or external companies or teams:
  • We'll give you an infrastructure that you're comfortable working with that doesn't have the usual enterprise computing overheads
  • We'll take responsibility for deploying and fixing the hardware, power, and connectivity - everything else is yours.
  • Quickly receive 1 or servers you need to get your product going - no paperwork and waiting around for kit to show up
  • We only have a few types of servers on offer - likely a "small" one for web/app servers and a "big" one for a database server.  We'll recommend some options if you're not sure what you need.  The servers are not redundant, fault-tolerant kit.  If you want that in trial, you'll need to build it into your application.
  • Tell us what OS you want. We have 3 standard OSs (Linux, Solaris, and, maybe Windows) and if you want something else it's going to be more difficult for everyone.
  • Tell us how much storage you need.  You'll get a little bit local on the server and a flexible capacity will be mounted on your server.  The flexible capacity can grow over time without any retooling or paperwork.
  • Tell us how much network capacity you'll need.  We'll QoS at that level.  Maybe no bursting allowed.
  • Your servers will be on their own subnet, just one flat LAN for everything.  No DMZ, multi layer firewalls.
  • You get a firewall in front of you with, tell us what inbound and outbound ports you want open for each server you request.  80, 443, and 22 are easy for us, everything else will make us raise an eyebrow.
  • Beyond simple firewalling, you manage your own security, e.g.,  locking down ports/services and OS patching
  • No content switch.  They're expensive and you're clever enough to use Apache to figure that out I'm sure.
  • Put your own monitoring in place, we're not going to watch it for you.  If you need to go from a "small" to "large" server or need more servers, you'll need to let us know.
  • Put your own backups in place.  Specify some flexible storage for them on one of the servers.  We won't be backing up anything.
  • All change control sits with you.  We have no oversight.
  • No remote hands provision is expected to be required.
  • If you're doing anything that affects production or other members of trial you're servers will be powered down immediately.
The Production Operations Team Point of View

How the trial environment is managed by the production operations team:
  • Beneath the edge network, the trial environment is on hardware fully separate and distinct from production.
  • Production operations owns and is responsible for the hardware, network, and power - both initial and on-going.  We provision a base OS and hand over the keys to the product team.  That's it.
  • Fairly generous SLA on responding to HW, network, power failures reported to production support.
  • The trial environment is ideally implemented with some type of in-house cloud service and/or VMWare.  If that's not possible, you'll have to manage by-box inventory so that you always have a few unused boxes of each type ready to commission.  Must keep a stand-by inventory ready to go.  Effective maintenance of slack and procurement to backfill is essential.
  • Create two server types, small and large.  Decide on cores, memory, disk space for each.  You will need to change this view over time, so re-evaluate it every 6-12 months.
  • Establish maybe 3 standard OS installs.  We don't own patching or securing the OS.
  • Use a SAN to enable flexible filesystem provisioning
  • Fixed maximum allocation of internet bandwidth for all members of trial, then fixed allocation to each member.  No trial member should be able to stomp on other trial members or anything in production.  QoS implemented.  Bursting is debatable.
  • Dedicated edge firewall.
  • Network to enable multiple subnets for each different user of trial.  Each user of trial can generally only see only their own network and servers.  Holes/routing between subnets and between enterprise and trial subnets may be conditionally opened for API (and only API; no e.g. DB) access.
  • No content switch, load balancer
  • No backups
  • No hardware RNG
  • Some Single Points of Failure ok
  • We own firmware updates for hardware
  • We don't monitor or alert on any virtual servers.  We do monitor and alert on underlying hardware, including the network kit and SAN.
  • May use a second tier hosting location for trial kit
  • It might be possible to use older kit being decommissioned from production for the trial environment.  While this would likely increase day-to-day operational costs (heterogenous and older kit), it would bring down initial capital investment in trial.  Also consider used/refurb kit.  Think cheap.
  • Keep a basic overview of trial and its services updated on the intranet.  Make sure all product managers and bizdev types are educated about it.
  • Periodically review trial usage with each business owner.
As the production team evolves the trial environment offer, it's likely that some of the "we don't do this in trial" items above will change as cost effective and lightweight ways are found to deliver them into trial.  Possible examples are backups, more sophisticated networking (load balancer), a provision of fault tolerant disk on server instances, or a shared (between trial members) database instance.

Other Considerations

Some other things to keep in mind to make the trial environment successful:
  • Aspects of production may be accessed via e.g. an API.  This introduces a point of vulnerability to production.  There are good design practices for hardening APIs that are exposed to trial such as logging, monitoring, authentication, rate limiting, and kill switches to protect what is on the production side of the API.
  • It may be cost efficient to spin up a single instance of an expensive service (e.g., Oracle) that can be shared between multiple trial members.  This introduces a fair amount of complexity to manage the DB itself including QoS, security, and change control.
  • The trial environment won't build and run itself for free.  Technical operation staff are required.  The number of staff should be proportional to level of change and size of the environment.
  • If a third party is involved, they must have an internal business representative championing their product or service, someone who understands the product and will champion it regularly.  A bizdevy, bring-the-external-party-in, hurl-over-the-wall-to-IT-ops doesn't work.
  • A product or service typically requires other functional contributions as well:  game platform operations, marketing, account management and sales, handling of amended contracts for the new product, on-going product management to improve the product, and website integration and updates.
  • Trial could also be used to spin up staging or pre-production test environments.

The trial environment can be used to provide a low cost alternative for startup, experimental, speculative, and just plain insane product ideas.  It's a hosting option that edgy product managers and bizdevs will like because of the lightweight commitment and speed of delivery.  Entrepreneurial teams and startups will like it because it'll feel like something they're use to and won't slow them down.  The production support team may feel uncomfortable with it initially because trial violates a lot of "best practices" in production.  But in the long run they'll see how it becomes a business enabler that fosters innovation in a cost effective way.

Good luck and let me know if you manage to establish a trial environment in your shop!

Pesky winmail.dat with Outlook, Apple Mac OS X Mail and me.com IMAP folders

I use me.com's IMAP folders to file email as part of an inbox zero policy I inflict on myself.  I recently had to resume using Exchange 2010 based email, but unfortunately not (yet!) directly connected to Apple Mac OS X Mail.  To access the Exchange mail, I've elected to use Outlook inside a Windows 7 VM.  I then hooked in my me.com IMAP folders into Windows Outlook so the Exchange and me.com folders are side by side in Outlook.  I then added an Outlook rule to copy new email that hits my Exchange Inbox to a me.com's Inbox.

This worked ok until I opened my first message in OS X Mail that had an attachment.  Instead of a normal attachment, I instead found a "winmail.dat" file.  Funny, I'd not seen one of those in a long time.

After some digging, I came across two ways to deal with the winmail.dat attachment problem:

1. The first and not recommended choice is TNEF's Enough.  The price is right at free/donations.  However, it's usability is awkward as you are required to open the *.dat file in a separate application that unpacks the TNEF format file from Exchange/Outlook and gives you the option to save the file.  I tried it, too painful.

2. The second and recommended choice is Letter Opener.  It is between USD 30-50.  It is seamlessly integrated with OS X Mail so you don't ever see a *.dat file, you see the attachments as you would in Outlook.

So if you stumble on this problem and need to consider TNEF's Enough versus Letter Opener, I'd recommend Letter Opener.