transmissions from a free roaming agent of kaos: 2010

07 June 2010

Enabling GNUPG (PGP) with Apple OS X mail.app

(Postnote 2011-03-05: Don't waste your time on the below. Just go directly to gpgtools mail, read the instructions, and get on with it. It's been updated to work with OS X 10.6 and Mail 4.4. Just tested it, works great.)

I am so not an expert on PGP, GNUPG (GNU Privacy Guard) or OS X's mail.app. But what I can do is explain how I got the basics of PGP working with Mac mail and some resources that helped.

If you don't know anything about PGP or want more detail, see "Learn More" section at the end of this post.

The following worked for Mac OS X 10.6.3 and mail.app 4.2.

1. Install GNU's Privacy Guard (gnupg).

You need to have Macports installed. Install it if you don't have it.

sudo port install gnupg

2. Generate your encryption key.

gpg --gen-key

Here are the options I used:



1. Option 2: DSA and Elgamal

2. Keysize: 3072 (that was the biggest keyvalue offered)

3. 0, key does not expire

4. Key identification

 Real name: Jeff Blogs

 email address: jeffblogs@dodgymail.com

 No comment

5. Passphrase "something memorable yet complicated and long, don't share it with anyone, and don't forget it"

Your ~/.gnupg directory of configuration and databases gets set up.

3. Install the magic mail.app bundle

The bundle contains a version of GPGMail that works with OS X 10.6.3.

Exit mail.app.

mkdir ~/Library/Mail/Bundles # if it doesn't exist already - mine didn't

Be thankful for clever, helpful and giving people and Download the bundle.

Extract from zip download and deposit GPGMail.mailbundle into ~/Library/Mail/Bundles

From the command line as the user you run mail with (not root!):

defaults write com.apple.mail EnableBundles -bool true

defaults write com.apple.mail BundleCompatibilityVersion 3

Start mail.app.

You should now have a PGP option in your mail menu (Message->PGP).

Mail.app menu with new PGP option

You should also see a PGP toolbar when you create a new email:

New PGP toolbar appears when composing a new email

(This step was the silver bullet from macrumors.com forum with an updated GPGMail from Lukas Pitschl - thank you!)

4. Create your public key.

From command line:

gpg --armor --output "Jeff Blogs.asc" --export jeffblogs@dodgymail.com

You'll need to send people your public key if you want them to send encrypted email back to you.

5. Add other people's public keys

gpg --import "Ronald McDonald.asc"

At this point you should now be able to send and receive PGP encrypted emails and mail.app will be reasonably supportive of you.

I found regularly restarting mail.app is useful when fiddling with gpg at the command line.

6. Set yourself up with a verified key service. This will decrease warnings from mail and GNUPG.

Set yourself up with pgp.com.

Use the name and email address you used to generate your key in step 2 above.

Add the verified key service key:
gpg --import keyserver1.pgp.comGlobalDirectoryKey.asc

Let GNUPG know about the pgp.com key server. Edit ~/.gnupg/gpg.conf and uncomment "keyserver ldap://keyserver.pgp.com" line.

(You're restarting mail.app between these steps right?)

7. Learn more!

These were helpful to the above:

knuthbert.com - How to use GPGMail with Mac OS X 10.6 (Snow Leopard)
sente.ch - PGP for Apple's Mail

These might have been helpful if they weren't really long, complicated, out of date, didn't work and I didn't already have the basic idea of how PGP was supposed to work:

linuxmafia.com - An overview of GNUPG and PGP brief
gnupg.org - The information mothership, good luck

And of course GPGMail itself, which doesn't work with current versions of Snow Leopard and mail.app.

-----

2010-06-19 Postnote: The latest OS X upgrade to Mail 4.3 disabled gpgmail. Two things to fix this:

1. Copy GPGMail.mailbundle from "~/Library/Mail/Bundles (Disabled)" to ~/Library/Mail/Bundles

2. Enter the GPGMail.mailbundle directory and add two new UUIDs to Info.plist in the "SupportedPluginCompatibilityUUIDs" section:

E71BD599-351A-42C5-9B63-EA5C47F7CE8E

B842F7D0-4D81-4DDF-A672-129CA5B32D57

And gpgmail is working again.

(As outlined by user Bytes_U on the Apple support forums.)

06 June 2010

Flapping Tell-tales: Over-Management of Products and Priorities

If you've ever sailed with a bit of curiosity, you've learned about tell-tales. Essentially they're the little flapping ribbons on a sail that help you know whether you're sail is working efficiently or not. If they're flapping about all different directions, you're sail isn't doing much for you.

Now imagine one sailing boat with one steering wheel, and 15 people tugging the wheel different directions trying to get the tell-tales to settle down and make efficient use of sail and wind to to move you toward your destination. Some of these people jostling about the helm are bigger than others, can shove the others aside and can really yank the wheel one direction. Some just stand there in the way thinking about other wheels they need to stand around later in the day. Others ask nicely for a go at the wheel and nudge it a bit one way or the other (n.b.: not to worry, per Darwin we'll evolve out the namby-pamby collaborative team players soon enough).

The same thing can happen in an internet gambling business with lots of opinionated non-technical product and other manager types trying to control the priorities of a small team of technology developers. There is a lot of "flapping" and the sailing, if you're moving at all, is not in the right direction.

The Tell-tales

Here are common tell-tales of organizational brokenness around product and prioritization:

Product managers are complaining that their "must have" "make or break" feature won't be delivered for months. Sales execs are blaming missing targets on not having a good product to sell and instead are selling products that don't exist. Instead of looking for work someplace where they can make sales and create new products, they plod along selling vapor and complaining about slow IT delivery.

It's also taking a lot of IT management to keep track of and prioritize a rapidly increasing backlog. There are 30 people across the business spinning out one liner requirements into a backlog maintained by 4 managers being worked on by 3 developers.
What's considered urgent one month in the backlog drops to no interest the next month. Features, once delivered, hit the market "too late". Product managers and sales execs are viewed by the technologists as being fickle, lacking follow-through and constantly changing their minds on what they want. IT is blamed for not keeping up with changing market requirements.
You have no bench capacity to chase emergent opportunities. These opportunities pass you by.

People comment that the "politics" in the company is increasing. What uncomfortable behaviors are labeled as "political"?

Conflicting priorities: there are more people to "drive" the organization than there are people to do the work. The people defining the priorities have an inability to agree a common set of priorities and stick to those priorities once they walk out of the "alignment" meeting. The software development manager receives multiple conflicting requests and priorities from two or more people and is then in the position of deciding between them.
Not sure who to follow: Lack of clear who-owns-what decision making structure. Seemingly arbitrary and often passive-aggressive punishment for following the "wrong" person. The "right" person is flavor of the month, then out the next month.
If people complain about a lack of strategy and clear priorities, they're criticized for not having the "mental agility" to work in "ambiguous situations" and requiring everything to be "black and white" and spelled out for them.
Managers are using persuasion, asking for "favors" and a "bit of extra work on the side". What's worse is they're then enabled and applauded for working this way.
People are careful to maintain plausible deniability and other CYA behaviors reducing overall business efficiency.
No one is actively nurturing an environment of trust and indeed such activities are viewed as a waste of time.

You're on a conference call with 15 people and realize:

Of the 15 people, one is there to ask question, listen carefully, and assess what is going to make high quality decisions. This is the big boss in the meeting. Unfortunately, this one person tends to do most of the talking and when they're not talking, they're "listening" while reading and answering their boss' emails on their Blackberry. This one person has the special ability to pay only marginal attention but then to hear certain keywords and then talk a lot about them.
One of the 15 people actually does the work being discussed.
One manager owns and directly manages the one worker.
Five people are trying to influence the one person who owns resources to do work for them, each one at the loss of the other four. These are product, project, and programme managers. One is the flavor-of-the-month alpha dog while four are rummaging for scraps and taking notes for ammunition against the alpha dog in the future. They alternate between threatening the one worker's manager and sucking up to the worker.
Two of the people distrust someone working for them and want to attend the meeting as their employee doesn't communicate a useful meeting summary with them or doesn't appear to pay much attention or take notes. The two people don't want the person working for them to embarrass the department or create more work with screw-ups. They don't contribute much unless they think their employee is about to screw up or conducting damage control afterwards.
Two people whose motivational imperatives haven't been crushed out of existence yet occasionally pipe in with genuinely helpful "cross-pollination", "how to do that" advice.
Five people on the call think the worker should be grateful to have been brought to a "high level" meeting to receive so much wisdom. The worker isn't one of the five.
There are at least four people who are in other groups, only vaguely associated with this project, have been invited to make sure they're "aligned", and really don't care much about what is being said. They're probably paying no attention and answering email, but no one will include them in the conversation anyway because no one is particularly sure why they're there. These people will often write "lack of communications" as a systemic corporate problem during annual HR surveys.
There are at least five "could you repeat that" requests during the call as people are caught out not paying attention.

What went wrong?

Fundamentally, the company has over-staffed Captains and purchased too few boats and crews. Communications are inefficient and durable alignment is non-existant. It's typically not the fault of the one worker involved (unless they aren't effective at the job they signed up for) - it's a leadership, organizational, and strategy failure.

Ok, so what went wrong? These are the reasons I've come across:

If you're a senior person in the organization, you understand senior people. You have less of an understanding of more junior staff proportional to how far you're removed from them. You hire a more senior employee because you relate to them better. This is ok (in fact, preferred) if the person you hire is going to build a team under them or at least have control over enabling budget/resources. If you find yourself stepping across and obviating an employee, you've created a problem.

Don't hire people you aren't planning to empower
(Humanely!) remove people that are no longer empowered

General wisdom is that a senior person is more skilled, knows more, can get more done. They can cover the more junior responsibilities if they need to.

This should apply to senior employees, but often doesn't. They either can't or don't want to operate at a junior level and are more interested in bizdev, strategy and commercials. They didn't enter the business with strong domain knowledge nor seem interested in developing it (why were they hired again?).
Although the "hire senior emlpoyee" wisdom may apply in some business functions, it's often wildly wrong in technology. The scope of knowledge in technology is enormous, changes rapidly, and can take a while to pick up.
Regardless, as a result the senior employee owner can't work at depth, so hires more junior product owners and/or other managers step in to fill the gap.

The product partitioning and ownership in the organizational strategy was wrong and the original strategy's advocates resist changing the strategy as they don't want to be seen having made a bad decision.
Not many execs with a non-tech upbringing understand IT. They don't understand how products are created and how long it takes to build IT and product organizations. They think increasing sales, project, product, and programme managers are needed to "keep the IT team focused" when deliveries are late and of low quality. They stay with what they know rather than drilling into detail they can't or don't want to understand.
The product(s) involved are sold in different "flavors" to different categories of customers. Each category will have it's own requirements and priorities. There is a lack of overall priority and alignment between them. The flavoring creates new late stage technical requirements.

Considerations

Some assumptions about the right way of to get your ship in order:

Don't have the software development manager making strategic business decisions by arbitrating priorities. Great that the manager is a little bit commercially aware and is sensitive to business and customer needs. However, the more time they're building this knowledge, they're not building software. Enable them to focus on software construction.
Don't expect the product managers, often peers, to effectively communicate with each other and prioritize between each other. They're typically measured on delivery of their product line, not playing nice with other product owners. A consistent and simple approach to business cases and approval process will help.
Don't have product managers that have to beg, borrow, steal resources. They shouldn't have to rely on persuasion to acquire core enabling resources as it leads to politics, distrust, confusion and all the related inefficiencies highlighted above.
Do assign one product manager to be the final decision maker accountable for overall product priorities.
Do have the software development team align their backlog priorities to the senior product manager's decisions.
Do hire product managers that understand at least at a high level the underlying technology and can sensibly evaluate technical debt tradeoffs. Hire product managers that are interested in how to most effectively use the technology and can appropriately prioritize non-functional technical requirements.

The Solution

Scale the number of people who want to own and drive product to people who can build those products. This balance is essential. You don't want 15 people behind one helm on one tiny boat.

If you have a big software development team with several potentially unaligned product managers creating work and issuing priorities, use a resource allocation model so that each product manager has to justify and fund a set of resources for their product responsibilities within the software development team. Preferably these are intact teams, like a scrum team. Resource allocations should be based on business value and/or risk management.

Make sure the software development team and manager have their own internal set of resources that they can internally prioritize to take on overall architecture, integration, code reviews, refactoring, mentoring, tools, how to scale the code base and personnel, and other (mostly) non functional requirements.

What about maintaining bench capacity? I have yet to work someplace where there is any bench capacity for very long - everyone is busy. The danger with bench is the organization may degenerate into a beg-borrow-steal model to grab these resources. Ideally try to keep some bench in the internal set of software development resources and balance technical debt versus emergent benefit. It really is powerful in an organization to have a few highly competent innovation-minded resources available to jump on the new market opportunity that just popped up.

Periodically review the resource allocations. Rebalance them between project deliveries to match changing business requirements. Also, commercial realities can change suddenly so you may need to re-allocate resources at short notice. So long as quick changes happen as an exception and not a rule and the reasons for it are clear and consistent, no one is going to object. Indeed it can be exciting and foster improved teamwork.

Identify and eliminate any point in the organization where product and team priorities, ownership and decision making is ambiguous or shared.

Consider carefully whether a project or programme manager should get between a product owner and the lead/manager of the software development team delivering to that product owner. Why does this extra layer exist? Who really understands what is in the backlog? Sometimes this additional management layer is justified if the project/product is big, complicated and/or there are heavy customer-driven process/audit/document requirements.

Here are two basic rules of thumb on whether someone is needed in the management layer:

If a team member forwards email to a peer more frequently than answering it or dealing with it, remove them from the team. If they forward it to someone working for them without adding value, are they delegating appropriately?
If a recurring meeting member has contributed nothing of value for more than a few meetings in a row, remove them from the meeting. They can read the meeting minutes or have a hallway conversation to catch up.

Conclusion

Fundamentally, an organization will be more efficient if there is a good match between Captains, boats and crews. It's not easy to get the balance right, and the need for all three will change frequently.

If you do have to error on one side or the other, error on the side of having more boats and crews. Better to have a little bit of spare execution capacity that can be thrown at an emergent opportunity than to degenerate into politically charged, rudderless anarchy.

03 May 2010

IT Hotsite Best Practices

Introduction

A "hotsite" is a general term for unplanned downtime - a failing site, product, or feature that is having significant impact on revenue generation. A problem is escalated to hotsite level when significant numbers of (potential) customers are affected and a business ability to earn money is significantly affected. Hotsite handling may or may not be used if the problem is not under direct control of the team controlling a set of systems (e.g., a critical feature the systems depend on is provided by a remote supplier, such as a web service being used by a mashup).

Hotsites happen. Costs increase infinitely as you push your system design and management to 100% uptime. You can aspire for 100% uptime, but it's foolish to guarantee it (e.g., in an SLA). Change can also cause service disruptions. In general, the less change, the less downtime. However, it's rarely commercially viable to strongly limit change.

This article isn't about reducing planned or unplanned downtime, it's a collection of tips, tricks, and best practices for managing an unplanned downtime after it has been discovered by someone who can do (or start to do) something about it. I'll also focus in on a new type of downtime, one that the people involved haven't seen before.

General strategy - the management envelope

It's important early on for a major problem to separate technically solving the problem from managing the problem itself into the wider business. Because an unplanned downtime can be extremely disruptive to a business, it's often almost as important to keep people informed about the event as solving the event itself.

Although that may feel like an odd statement, as a business grows there are people throughout the business that are trying to manage risk and mitigate damage caused by the downtime. Damage control must be managed in parallel with damage elimination.

You want to shelter those that are able to technically solve the problem from those that are hungry for status and are slowing down the problem solving process by "bugging" critical staff for information. Technical problem solving tends to require deep concentration that is slowed by interruptions.

It is the management envelope's responsibility to:

Agree periods of "no interruption" time with the technical staff to work on the problem
Shelter the team from people asking for updates but are not helping to solve the problem
Keep the rest of the business updated on a regular basis
Set and manage expectations of concerned parties
Recognize if no progress is being made and escalate
Make sure the escalation procedure (particularly to senior management) is being followed
Make sure that problems (not necessarily root cause related) discovered along the way make it into appropriate backlogs and "to-do" lists

General strategy - the shotgun or pass-the-baton

Throughout the event, you have to strike a balance between consuming every possible resource that *might* have a chance to contribute (the "shotgun") versus completely serializing the problem solving to maximize resource efficiency ("pass-the-baton").

Some technologists, particularly suppliers who might have many customers like yourself, may not consider your downtime as critical as you do. They will only want to be brought in when the problem has been narrowed down to their area and not "waste" their time on helping to collaboratively solve a problem that isn't "their problem".

There is a valid argument here. It is ultimately better to engage only the "right" staff to solve a problem so that you minimize impact on other deliverables. Your judgment about who to engage will improve over time as you learn the capabilities of the people you can call on and the nature of your problems.

However, my general belief for a 24x7 service like an Internet gambling site that is losing money every second it is down, calling whoever you think you might need to solve the problem is generally fully justified. And if you're not sure, error on the shotgun side rather than passing the baton from one person to the next.

General strategy - the information flows and formats

Chat. We use Skype chat with everyone pulled into a single chat. Skype's chat is time stamped and allows some large number of participants (25+) in a single chat group. We spin out side chats and small groups to focus on specific areas as the big group chat can become too "noisy", although it's still useful to log information. It gives us a version history to help make sure change management doesn't spin out of control. We paste in output from commands and note events and discoveries. Everything is time threaded together.

The management envelope or technical lead should maintain a separate summary of the problem (e.g., in a text editor) that evolves as understanding of the problem/solution evolves. This summary can be easily copy/pasted into chat to bring new chat joiners up to speed, keep the wider problem solving team synchronized, and be used as source material for periodic business communications.

Extract event highlights as you go. It's a lot easier to extract key points as you go then going through hours of chat dialogues afterwards.

Make sure to copy/paste all chat dialogues into an archive.

Email. Email is used to keep a wider audience updated about the event so they can better manage into partners and (potential) customers. Send out an email to an internal email distribution list at least every hour or when a breakthrough is made. Manage email recipients expectations - note if there will be further emails on the event or note if this is the last email of the event.

The emails should always lead off with a non-technical summary/update. Technical details are fine, but put them at the end of the message.

At a minimum, send out a broad distribution email when:

The problem first identified as a likely systemic and real problem (not just a one off for a specific customer or fluke event). Send out whatever you know about the problem at that time to give the business as much notice as possible of the problem. Don't delay sending this message while research is conducted or a solution is created.
Significant information is discovered or fixes created over the course of the event
Any changes are made in production to address the problem that may affect users or customers
More than an hour goes by since the last update and nothing has otherwise progressed (anxiety control)
At the end of a hotsite event covering the non-tech details on root cause, solution, impact (downtime duration, affected systems, customer-facing affects)

Chain related emails together over time. Each time you send out a broad email update, send it out as a Reply-All to your previous email on the event. This gives new-comers a connected high-level view of what has happened without having to wade through a number of separate emails.

Phone. Agree a management escalation process. Key stakeholders ("The Boss") may warrant a phone call to update them. If anyone can't be reached quickly by email and help is needed, they get called. Keep key phone numbers with you in a format that doesn't require a network/internet connection. A runbook with supplier support numbers on the share drive with a down network or power failure isn't very useful.

The early stage

Potential hotsite problems typically come from a monitor/alert system or customer services reporting customer problems. Product owners/operators or members of a QA team (those with deep user-level systems knowledge) may be brought in to make a further assessment on the scope and magnitude of the problem to see if hotsite escalation is warranted.

Regardless, at some point the first line of IT support is contacted. These people tend to be more junior and make the best call they can on whether the problem is a Big Deal or not. This is a triage process, and is critical in how much impact the problem is going to make on a group of people. Sometimes, a manager is engaged to make a call of whether to escalate an issue to hotsite status. Escalating a problem to this level is expensive as it engages a lot of resources around the business and takes away from on-going work. Therefore, a fair amount of certainly that an issue is critical should be reached before the problem is escalated to a hotsite level. The first line gets better at this with escalation with practice and retrospective consideration of how the event was handled.

Once the event is determined to be a hotsite, a hotsite "management envelope" is identified. The first line IT support may very well hand off all problem management and communications off to the management envelope while the support person joins the technology team trying to solve the problem.

All relevant communications now shift to the management envelope. The envelope is responsible for all non-technical decisions that are made. Depending on their skills, they may also pick up responsibility for making technical decisions as well (e.g., approving a change proposal that will/should fix the problem). The envelope may change over time, and who the current owner and decision maker is should be kept clear with all parties involved.

The technical leader working to solve the problem may shift over time as possible technical causes and proposed solutions are investigated. Depending on the size and complexity of the problem, the technical leader and management envelope will likely be two different people.

Holding pages. Most companies have a way to at least put up "maintenance" pages ("sorry server") to hide failing services/pages/sites. Sometimes these blanket holding pages can be activated by your upstream ISP - ideal if the edge of your network or web server layer is down. Even better is being able to "turn off" functional areas of your site/service (e.g., specific games, specific payment gateways) in a graceful way such that the overall system can be kept available to customers while only the affected parts of the site/service are hidden behind the holding pages.

Holding pages are a good way to give yourself "breathing room" to work on a problem without exposing the customer to HTTP 404 errors or (intermittently) failing pages/services.

Towards a solution

Don't get caught up in what systemic improvements you need to do in the future. When the hotsite is happening, focus on bringing production back online and just note/table the "what we need to do in the future" on the side. Do not dwell on these underlying issues and definitely no recriminations. Focus on solving the problem.

Be very careful of losing version/configuration control. Any in-flight changes to stop/start services or anything created at a filesystem level (e.g., log extract) should be captured in the chat. Changes of state and configuration should be approved in the chat by the hotsite owner (either the hotsite tech lead or the management envelope). Generally agree within the team where in-flight artifacts can be created (e.g., /tmp) and naming conventions (e.g., name-date directory under /tmp as a scratchpad for an engineer).

All service changes up/down and all config file changes or deployment of new files/codes should be debated, then documented, communicated, reviewed, tested and/or agreed before execution.

Solving the problem

At some point there will be an "ah-ha" moment where a problem is found or a "things are looking good now" observation - you've got a workable solution and there is light at the end of the tunnel.

Maintaining production systems configuration control is critical during a hotsite. It can be tempting to whack changes into production to "quickly" solve a problem without fully understanding the impact of the change or testing it in staging. Don't do it. Losing control of configuration in a complex 24x7 environment is the surest way to lead to full and potentially unrecoverable system failure.

While it may seem painful at the time, quickly document the change and communicate it in the chat or email to the parties that can intelligently contribute to it or at least review it. This peer review is critical in helping to prevent making a problem worse, especially if it's late at night trying to problem solve on little or no sleep.

Ideally you'll be able to test the change out in a staging environment prior to live application. You may want to invoke your QA team to health check around the change area on staging prior to live application.

Regardless, you're then ready to apply the change to production. It's appropriate to have the management envelope sign off on the fix - certainly someone other than the person whose discovered and/or created the fix must consider overall risk management.

You might decide to briefly hold off on the fix in order to gather more information to help really find a root cause. It is sometimes the case that a restart will likely "solve" the problem in the immediate term, even though the server may fail again in a few days. For recurring problems the time you spend working behind the scenes to identify a more systemic long term fix should increase with each failure.

In some circumstances (tired team, over a weekend) it might be better to shut down aspects of the system rather than fix it (apply changes) to avoid the risk of increasing systems problems.

Regardless, the step taken to "solve" the problem and when to apply it should be a management decision, taking revenue, risk, and short/long term thinking into account.

Tidying up the hotsite event

The change documentation should be wrapped up inside your normal change process and put in your common change documentation archive. It's important you do this before you end the hotsite event in case there are knock on problems a few hours later. A potentially new group of people may get involved, and they need to know what you've done and where they can find the changes made.

Some time later

While it may be a day or two later, any time you have an unplanned event, as IT you owe the business a follow-up summary of the problem, effects and solution.

When putting together the root cause analysis, keep asking "Why?" until you bottom out. The answers may become non-technical in nature and become commercial, and that's ok. Regardless, don't be like the airlines - "This flight was late departing because the aircraft arrived late.". That's a pretty weak excuse for why the flight is running late.

Sometimes a root cause is never found. Maybe during the event you eventually just restarted services or systems and everything came back up normally. You can't find any smoking gun in any of the logs. You have to make judgment call on how much you invest in root cause analysis before you let go and close the event.

Other times the solution simply isn't commercially viable. Your revenues may not warrant a super-resiliant architecture or highly expensive consultants to significantly improve your products and services. Such a cost-benefit review should be in your final summary as well.

At minimum, if you've not solved the problem hopefully you've found a new condition or KPI to monitor/alert on, you've started graphing it, and you're in a better position to react next time it triggers.

A few more tips

Often a problem is found that is the direct responsibility of one of your staff. They messed up. Under no circumstances should criticism be delivered during the hotsite event. You have to create an environment where people are freely talking about their mistakes in order to effectively get the problem solved. Tackle sustained performance problems at a different time.

As more and more systems and owners/suppliers are interconnected, the shotgun approach struggles to scale as the "noise" in the common chat increases proportional to the number of people involved. Although it creates more coordination work, side chats are useful to limit the noise, bringing in just those you need to work on a sub-problem.

Google Wave looks like a promising way to partition discussions while still maintaining an overall problem collaboration document. Unfortunately, it's easy to insist all participants use Skype (many do anyway), but it's harder with Wave that not many have used or don't even have an account or invite available.

Senior leadership should re-enforce that anyone (Anyone! Not just Tech) in the business may be called in to help out with a hotsite event. This makes the intact team working on the hotsite fearless about who they're willing to call for help at 3am.

Depending on the nature of your problem, don't hesitate to call your ISP. This is especially true if you have a product that is sensitive to transient upstream interruptions or changes in the network. A wave of TCP resets may cause all kinds of seemingly unrelated problems with your application.

Conclusion

Sooner or later your technical operation is going to deal with unplanned downtime. Data centres aren't immune to natural disasters and regardless, their fault tolerance and verification may be no more regular than yours.

When a hotsite event does happen, chances are you're not prepared to deal with it. By definition, a hotsite is not "business as usual" so you're not very "practiced" in dealing with them. Although planning and regular failover and backup verification is a very good idea, no amount of planning and dry runs will enable you to deal with all possible events.

When a hotsite kicks off, pull in whoever you might need to solve the problem. While you may be putting a spanner into tomorrow's delivery plans, it's better to error on the shotgun (versus pass-the-baton) side of resource allocation to reduce downtime and really solve the underlying problems.

And throughout the whole event, remember that talking about the event is almost as important as solving the event, especially for bigger businesses. The wider team wants to know what's going on and how they can help - make sure they're enabled to do so.

Using MobileMe's iDisk as an interim backup while traveling

Introduction

I use an Apple laptop hard disk as my primary (master) data storage device. To provide interim backups while traveling, I use Apple's MobileMe iDisk for network backups to supplement primary backups only available to me when I'm at home.

Having dabbled with iDisk for a few years, I have two key constraints for using iDisk:

I don't always have a lot of bandwidth available (e.g., a mobile phone GPRS connection) and I don't want a frequent automatic sync to hog a limited connection.
I don't trust MobileMe with primary ownership of data or files. Several years ago I switched to using the iDisk Documents folder (with local cache) for primary storage but then had several files magically disappear.

I've now evolved to using iDisk as a secondary backup medium. I manually run these steps when I have plenty of bandwidth available. There are two steps to this:

rsync files/folders from specific primary locations to a named directory under iDisk
Sync the iDisk

How to do it

The rsync command I use looks like this:

for fn in Desktop dev Documents Sites; do
   du -sk "/Users/my_username/$fn" | tee -a ~/logs/laptop_name-idisk.rsync.log
   rsync -avE --stats --delete "/Users/my_username/$fn" "/Volumes/my_mobileme_name/laptop_name/Users/my_username" | tee -a ~/logs/laptop_name-idisk.rsync.log
done


The rsync flags in use:


-a         archive (-rlptgoD no -H)
           -r    recursive
           -l    copy symlinks as symlinks
           -p    preserve permissions
           -t    preserve times
           -g    preserve group
           -o    preserve owner
           -D    same as "--devices --specials" (preserve device and special files)
-v         verbose
-E         preserve extended attributes
--stats    detailed info on sync
--delete   remove destination files not in source

Explanation:

I'm targeting specific locations that I want to backup that aren't overly big but tend to change frequently (in this case several folders from my home directory: Desktop, dev, Documents, Sites)
A basic log is maintained, including the size of what is being backed up (the "du" command)
I use rsync rather than copy because rsync is quite efficient - it generally only copies the differences, not the whole fileset.
The naming approach on the iDisk allows me to keep a backup by laptop name allowing me to keep discrete backup collections over time. My old laptop and backups sit beside my current laptop backups.
The naming approach also means I don't use any of the default directories supplied by iDisk as I'm not confident that Apple won't monkey with them.
~/Library/Mail is a high change area but not backed up here (see below for why)

The rsync updates the local iDisk cache. Once the rsync is complete (after the first rsync I find it takes less than 10 seconds for subsequent rsyncs), manually kick off an iDisk network sync (e.g., via a Finder window, clicking on the icon next to iDisk).

An additional benefit to having a network backup of my important files and folders is that I can view and/or edit these files from the web, iphone, or PC. I find that being able to access email/IMAP from alternative locations is the most useful feature, but I have had minor benefit from accessing files as well when my laptop was unavailable or inconvenient to access (e.g., quick check of a contract term in the back of a taxi on an iphone).

Other Backups

I have two other forms of backups:

Irregular use of Time Machine to a Time Capsule, typically once a week if my travel schedule permits.
MobileMe's IMAP for all email filing (and IMAP generally for all email).

Basically, if I'm traveling, I rely on rsync/iDisk and IMAP for backups. I also have the ability to recover a whole machine from a fairly recent Time Machine backup.

Success Story

In June 2009 I lost my laptop HDD on a return flight home after 2 weeks of travel. I had a Time Machine backup from right before I'd left on travel, and occasional iDisk rsyncs while traveling.

Once I got home I found an older HDD of sufficient size and restored from the Time Machine image from the Time Capsule. This gave me a system that was just over 2 weeks "behind". Once IMAP synchronized my mailboxes, that only left a few documents missing that I'd created while traveling. Luckily I'd run an rsync and iDisk right before my return flight, so once I'd restored those, I'd recovered everything I'd worked on over the two weeks of travel, only missing only some IMAP filing I'd done on the plane.

Weakness

The primary flaw in my approach is that you have to have the discipline to remember to manually kick off the rsync and iDisk sync after you've made changes you don't want to lose. I certainly don't always remember to run it, nor do I always have a good Internet connection available to enable it. However, I find that remembering sometimes is always better than not having any recent backup at all.

Alternative Approaches

An obvious alternative is to use the MobileMeBackup program that is preloaded onto your iDisk under the Software/Backup directory. Using this tool, you should be able to perform a similar type of backup to what I've done here. I've not tried it as it was considered buggy back when I first started using iDisk for network backups. I'll likely eventually try this and may shift to it if it works.

A viable alternative approach is to carry around a portable external hard drive, and make Time Machine backups to it more frequently than you would otherwise do over the network via iDisk. You could basically keep a complete system image relatively up-to-date if you do this. More hassle, but lower risk and easier recovery if your primary HDD fails. However, if you get your laptop bag and external HDD stolen, you'll be worse off.

While on holiday recently, I was clearing images off of camera SD card memory as it filled up. I put these images both on the laptop HDD and an external HDD. This protects me from laptop HDD failure, but wouldn't help if both the laptop and external HDD was stolen.

iDisk Comparison to DropBox

DropBox is a popular alternative to iDisk. I find DropBox to be better at quickly and selectively sharing files, it has better cross-platform support (particularly with a basic Android client), and it's sync algorithm seems to work better than the iDisk equivalent. You could certainly do everything described here with DropBox.

The downside with DropBox is having to pay $120 per year for 50GB of storage versus $60-100 per year ($60 on promotion, e.g., with a new Apple laptop; otherwise $100) for 20GB of storage with MobileMe. I find 20GB to be plenty for IMAP, iDisk and photos providing I filter out big auto-generated emailed business reports (store on laptop disk not in IMAP), and only upload small edited sets of photos. I'll probably exhaust the 20GB in 2-3 more years at my current pace, but I'd expect Apple to increase the minimum by the time I would otherwise be running out of space.

MobileMe is of course more than just iDisk, so if you use more of it's features, it increases in value relative to DropBox.

Both iDisk and DropBox are usable choices, the differences are not sufficiently material to strongly argue for one or the other. I have seen iDisk improve over the last few years and I'd expect Apple to eventually catch up with DropBox.

Conclusion

While I'm not confident in using MobileMe's iDisk as a primary storage location, I have found it useful as a network backup. Combined with normal backups using Time Machine and Time Capsule, it provides a high-confidence recovery from damaged or lost primary use laptops.

21 March 2010

Using wget to ask jspwiki to re-index its search DB

For whatever reason, our installation of jspwiki (v2.8.2) decides to ignore or lose pages out of its index (hey, what do you want for free?!). With our jspwiki hitting 2000 pages, search is the main tool to find pages. Unfortunately, I've taken to keeping my own links page to important pages just so I don't lose them as the search indexing seems to break regularly. While a re-index solves the problem, but it requires going into the site, authenticating, and clicking a button - way too much work.

Here is a quicky to use wget to log in to jspwiki and force a re-indexing of pages:



# POST to log in and get login and session cookies

wget --verbose --save-cookies=cookie --keep-session-cookies --post-data="j_username=myuid&j_password=mypw&redirect=Main&submitlogin=Login" "http://wiki.mydomain.com/JSPWiki/Login.jsp" --output-document "MainPostLogin.html"



# POST to kick off reindexing using cookies

wget --verbose --load-cookies=cookie --post-data="tab-admin=core&tab-core=Search+manager&bean=com.ecyrd.jspwiki.ui.admin.beans.SearchManagerBean&searchmanagerbean-reload=Force+index+reload" --output-document "PostFromForceIndexReload.html"  "http://wiki.mydomain.com/JSPWiki/admin/Admin.jsp"

Tweak myuid, mypw, and wiki.mydomain.com in the above to have them be what you need. Drop the output once you're comfortable it's working (I was saving it in the above to make sure I could see artifacts of being authenticated in the output).

Put the above into a cron'ed script and run it hourly.

Note that all versions of wget are not created equal as 1.10 didn't seem to work but 1.10.2 and 1.12 worked fine for the above.

QCon London 2010 - Miscellaneous Topics

There were no shortage of interesting topics at QCon London 2010. Although I'm writing in some depth about a few of them due to personal interest and/or applicability to Internet gambling, there are many others I'll highlight here briefly.

Shared nothing architecture
- Each node of a system is stand-alone and shares nothing with other nodes
- Great horizontal scalability
- Shared databases, data stores, caches are constraining
- Great for stateless, single-shot request-response, and content oriented services; less so for multi-state transactional systems

Industry Consolidation driving big boys architectures
- Internet traits such as the network effect and rapid feedback loops accelerate consolidation on a single market-dominant (defacto monopoly) services (e.g., ebay, betfair)
- Big consolidated services require a big compute capacity. Market convergence on a single supplier isn't possible if that supplier can't scale to meet demand.
- Big compute capacity requires a lot more thought on the "-ilities" (non-functional attributes) of service delivery. Functionality becomes commodity.
- Web technologies are embracing "traditional" approaches to increase compute capacity: asynchronous message oriented design, greater attention to maximizing hardware
- Consolidation also means longer life of legacy software
- The CAP theorem (see below) is coming into play for big systems that want to be highly available and need to massively scale

Programmer Quality of Life wins over Abstraction and Separation
- XML is painful
- Co-locate configuration with code (annotations)
- Convention over configuration (even Java coming on board with apps like Roo on Spring)
- Repetitive coding requirements should be built in (no boilerplate or scaffolding) - aspects relatively for free
- (Where does that leave dependency specification? Hello maven pom.xml my nemesis!)

HTML 5, CSS 3, and Javascript versus rich client interface technologies
- Native executables (Wintel binaries) used to be way ahead of the browser on usability and richness but HTML/CSS/JS continues to move the browser experience closer to native executable experience
- Major new browser advances are right around the corner
- Flash, Air, Silverlight - great interfaces but browser continues to advance

- Mobile causing a renaissance of RIA and native executables - but browser continues to advance
- Innovation areas will tend to use an RIA and then the browser will catch up
- High touch experience (e.g., game graphics with high performance requirements) will require native executable performance for some time to come
- For most enterprise and business requirements, the browser experience is already sufficient today

Power efficiency, carbon credits and trading
- Assuming carbon trading advances, we might see a day where well written (more efficient, less energy consumptive) applications are important again
- Energy efficient HW (e.g., Sparc v Intel) may be more valued
- Some odd things may happen such as shifting compute capacity (carbon emission) to third world "carbon dumping grounds" due to economic incentives

Right tool for the right job versus efficiency from limited technology choices
- Although Java is dominant in the enterprise, Ruby is making inroads. Recognition of productivity boost of a pleasant coding environment that encourages DRY and good programming techniques.
- Functional languages that facilite multi-core (parallel) computing are increasing in popularity as currently popular languages in the enterprise do not (Java!)

- Advent of language neutral information passing protocols to better enable innovation within components (but not forcing between components)
- As of today, homogeneous technology choices for the enterprise are still winning

Software Developer can "do it all"
- Moving test into development through TDD (and from unit to functional and some end-to-end)
- Cloud services abstracting operational systems (the specific HW and OS don't matter)
- Moving live deployment into development (Continuous Integration leading to Continuous Deployment)
- Better to use a shared nothing architecture under developer's control than reliance on specialty approaches like a cache in BigIP F5s or a shared in-memory cache

And a grab bag of others:

OSGi and Java. JARs lack versioning and dependency declarations and therefore lack safe coupling. OSGi defines bundles to make integration/upgrade safer. Feels complicated versus using a Convention over Configuration approach. Could we use co-located annotations in the code instead to describe dependencies? What about dependencies outside a specific application/JVM?
SOA (Service Oriented Architecture) is dead, long live SOA! (No one seems to like SOA but a lot of practices from SOA are in prevalent and growing use)
TDD (Test Driven Development) is pretty much assumed now even for the smallest teams and projects. CI in varying states but clearly the next development practice that will be an assumption shortly
Log everything (Google, Facebook) - both customer actions and internal systems and be able to compare anything to anything
CPU clocks hitting speed limits. Until some new as yet unidentified technology breakthrough, CPU clock speeds have hit about as fast as they're going to be. From now forward it will be about parallel processing on a growing number of cores.
DDD (Domain Driven Design) - Design software with the interests of specific stakeholder's interests at heart, using the stakeholder's terms ("Ubiquitous Language") Let stakeholder interest area ("Bounded Context") warp a "perfect" implementation to one that is tailored to the stakeholder's needs. In a complex system, identify the Domains of interest, and design around each of them in parallel with figuring out how to glue together these Domains.
CAP theorem - pick 2: Consistency, Availability, and Partition tolerance (CAP). Business will generally pick Availability and Partition tolerance, so that leaves Consistency as the odd man out and implies that more attention is then needed on identifying and recovering from inconsistent states. Eventual consistency for some functions is sufficient.
New persistance models - Social networks with their many-to-many relationships in the data are driving the use of new persistance models to supplement their relational databases
Dreyfus model of skill acquisition - a good way to take a view on how people pick up skills and as a way to assess how skilled/mature your staff actually is

17 March 2010

QCon London 2010 - Cloud Computing

Cloud computing and virtualization was a popular topic at QCon London 2010.

Background/primer/proposition:

Cloud marketing suggests that hardware and/or systems administration is now a commodity that you shouldn't have to think about too much and can safely outsource.
Just like TDD (Test Driven Development) decreases the need for QA, CI (Continuous Integration) with direct deployments into an operational environment will decrease the need for systems administration.
Outsourced pay-as-you-use cloud propositions will likely cause costs to switch from capex to opex to budget for computing capacity (was traditionally HW and SW in capex)
Grossly simplifying, there are four interesting cloud propositions available:

In-house hardware virtualization - cloud under your control, in your data centre (e.g., VMware, Xen, Solaris Zones)
Outsourced hardware virtualization (IaaS - Infrastructure as a Service) - cloud as an "infinite capacity" of generic computing and you define the systems from the OS up (e.g., Amazon's AWS EC3)
Outsource compute capacity (PaaS - Platform as a Service) - cloud as a place to deploy software components into a fairly tightly defined (constrained) operating environment (e.g., Google's App Engine)
Pure services (SaaS - Software as a Service) - cloud as a source of "commoditized" services to be used when you construct an application (e.g., Google's web analytics, Facebook OAuth API for user credential management, AWS's S3 for storage)

Cloud means that you can cost effectively create and delete computing resources as needed for parts of your IT environment that don't require regular use. For example testing and in particular load testing.
Non-tech business types get excited by cloud because:

If your an entrepreneur type, you get bonus points for running your infrastructure from the cloud when looking for funding (more-so in the last two years, this is declining some now)
Finance and P&L owners get exited any time they can commoditize something to drive down costs. Tech has mixed feels about this as "drive down costs" tends to imply redundancies.
Easier to justify upfront costs for a new business case if you only pay for what you use (a failure is easy to delete, no sunk capex expenditures)

Both tech and non-tech types get excited about not having to generate a lot of paperwork then wait for authorizations and shipping times to get new kit. Assuming company bureaucracy doesn't shackle down cloud controls too vigorously, a new virtual platform can made available very quickly and at low costs.
If you can maximize utilization of HW you buy, then it's no different than buying cloud resources (likely cheaper)

General Observations on Cloud and Virtualization

Virtualization enables us to achieve that solutions architecture ideal of "one box one purpose", it just that it's become "one virtual box one purpose".

Virtualization enables us to take applications that don't have a good threading model to take advantage of boxes with many cores and use up all the cores (application per VM; VMs added until all cores are utilized)

Cloud does imply a lack of control over your core infrastructure. Do you need this control?

The cloud is still just a bunch of hardware systems in a data centre. There is no magic. Their DC and systems admins will have their share of problems as well. If the cloud sysadmins can provide more uptime than your own techops can provide at a similar cost point, the argument for cloud increases.

Similarly, there is debate over how good the SLAs are for cloud. But really, how enforceable are the SLAs you have anyway?

Your choice of virtualization or cloud will enforce a way of creating applications and handling services. You may not like it. Conversely, it may force you to be disciplined in a new way otherwise missing when you create applications.

You will make an investment to learn the systems and make your applications work in the cloud environment. This will cost and create some lock-in. This is more true for PaaS than IaaS.

The cloud is being used to "long tail" a number of services. Service "particles" are appearing you can use to provide an aspect of functionality in your overall solution. The more of these partners you use that are in the same cloud with you, the greater the efficiencies and hence lower costs. Combined with first mover advantage and vendor lock-ins, this is a network effect that should drive toward having just a few cloud suppliers in a few years.

Relating Cloud to Internet Gambling Business

The use of an in-house cloud like VMWare makes good sense. We're regularly adding in new products that need to undergo development and test yet we don't need permanent capacity to service these requirements. While a VMWare setup can't fully proxy a production environment (unless you use VMWare in production as well), it is very suitable for most types of functional verification other than load and low level device compatibility.

Being able to hand the keys over to a set of virtualized servers enables more entrepreneurial behavior. For example, if you have a larger business that has a heavy layer of process, you can still work effectively with start-up partners. Give them the keys to their own set of systems and they can do whatever they want with them without impacting your core systems. At which time they're proven successful, their revenue stream can justify improved risk management.

Handling flash crowds with cloud probably isn't possible for our industry today. In-house clouds don't really handle flash crowds (Why not just have the capacity there anyway? What do you want to cripple to support that big marketing campaign?). Outsourced cloud generally isn't possible as the bigger cloud providers may not allow internet gambling to be run within their clouds (AWS restriction anyway; and yes, this will likely ease up at some point, just look at Akamai's behavior on Internet Gambling). Also a CDN (Content Distribution network; an SaaS of a sorts) will take care of a lot of the flash crowd load we experience.

Using an outsourced cloud PaaS for data analytics doesn't seem likely. Data analytics crunching benefits from close proximity to the data set being crunched. Bandwidth to upload big data sets into the cloud from higher connectivity costs locations (lots of internet gambling in offshore locations with expensive ISP costs) doesn't make sense.

SaaS however is quite interesting. Services like Google Analytics that enable almost real-time data analysis are clearly the way to go for an Internet gambling site. Highly bespoke business analytics will likely stay inside the business or use a SaaS for commodity analytics.

Depending on who you ask, the following may be real risks or just FUD:

Taxation - as services are sourced from someplace other than the tax advantaged place you have your business in, you are at risk of emerging taxation implications
Centralized point for governments to enforce legal compliance. By hosting in the cloud (which is actually going to be one or more physical data centres), you've given the governments that have oversight of those data centres a good choke point to use against you. They could use taxation, inappropriate content, or services not in compliance with regulation.

Conclusion

Virtualization makes complete sense for Internet gambling companies, all the way from development through to production. That's not news, most in our sector have been using virtualization for a few years now.

On Cloud/IaaS provisions, AWS (a clear IaaS market leader) have flatly disallowed any internet gambling related operations inside their service. While it is likely you could get away with internal use (dev, test) of cloud in these services, do you want to create a dependency and then have it suddenly shut off on you? AWS of course isn't the only show in town for IaaS There are other providers - you would have to evaluate them versus related risk factors and re-development costs to integrate their use into your environment.

There is no clear use yet of Cloud/PaaS for standard Internet gambling products.

There are plenty of emergent opportunities to use Cloud/SaaS for Internet gambling.

(Index of emergent technologies applied to Internet Gambling)

QCon London 2010 - Themes and Trends

Last week I had the good fortune to attend QCon London which bills itself as an "enterprise software development conference".

I thought the conference struck a good balance between maybe 40% academic/futures/ideas from the ivory towers versus 60% practical, grunty software development from the trenches. That of course varied by what sessions you attended as there were various tracks and tutorials available.

QCon was fairly software development centric. Although there were tracks on technical operations and QA, both felt more like "what software development thinks how techops and QA should work" versus hardcore QA and techops experts running the tracks and presenting.

Although billed as "enterprise" software development, QCon was new media centric. Less about enterprise and more about entrepreneurship using (recently) new tools and techniques to deliver and manage software. I found this quite suitable for igaming that is still more entrepreneur land than it is enterprise.

The following are themes and trends that were in the air at Qcon that captured my interest. Some will be old and familiar (yet receiving continued attention), others are relatively emergent in the last year or so. Each item may eventually lead to a blog entry with detailed commentary on the subject and as relevant a view on how they apply to internet gambling systems.

Cloud Computing
NoSQL versus Relational Databases
RESTful architecture
Functional programming languages
Post Scrum
Mobile Computing
Event based architectures, asynchronous messaging
DevOps, particularly Continuous Deployment
Miscellaneous topics

21 February 2010

Internet Gambling Jobs in Gibraltar

Gibraltar has plenty of on-line gambling companies and there is almost always some form of related recruitment going on.

Whether you want to make a fresh start in Southern Spain or just got made redundant, the following are some good starting points if you're interested in working in Gib. While I've got an IT bias, none of the companies below specialize only in IT.

(Please note I'm not affiliated with any of the companies listed below although I've talked with all of them over the years.)

Gibraltar Local Recruiters

Quad has been in Gibraltar for a long time now (at least in dog years), lists Gibraltar online gambling jobs, and has plenty of information about Gib and Spain on their site.

Ambient has been around for a fair amount of time as well. They are not Gibraltar based, but close enough (up the Costa).

SRG has just opened up their office next door in Europort in Gib. Their website covers some basics about Gib like living in Gib/Spain and local Income Tax.

Other Recruiters Operating into Gibraltar

There are plenty of other recruiting companies outside of Gib (typically UK) that operate into Gib. They come and go, and the recruiters themselves change over time.

There are also a variety of headhunters that typically work other sectors that come and go.

There are two companies that have been around for a long time that have done plenty of work for Gib based companies: BettingJobs.com and Pentasia. Both of these companies place world-wide but you can find Gibraltar jobs on their sites as well.

Other Job Sources for Gibraltar

It's traditional (but not cheap!) to post jobs in The Gibraltar Chronicle newspaper on Fridays. Yes, this is an actual paper newspaper, just like the ones Grandpa used to read. They don't cross-post jobs to their website.

I keep an eye on jobserve using category IT and keyword "Gibraltar" to create an RSS feed to see what my IT colleagues around Gib are up to.

The GRA (Gibraltar Regulatory Authority) thoughtfully provides a summary page of all "remote gambling" operators with Gib licenses.

Wildcards

I've not personally worked with the following, but they at least had a few listings or Gib or along the Costa. YMMV.

gibraltarportal.com lists a few local jobs.

The surinenglish.com delegates their recruitment to myservicesdirectory.com, it's very Spain oriented, not too much interesting on the Gib and IT side.

I'm not familiar with Andalucia Technology Recruitment. I've not seen anything for Gib on their site, but they do have a few IT roles along the Costa.

Bits and Pieces

There is an Excel sheet you can download from the Gib government site to calculate your potential income tax. With the same salary, you'll typically be better off in Gib than most other European countries.

Other starting points

EGR has published a short list of nominees for their 2010 igaming awards, it's also a good source of companies to look at, although certainly not limited to just Gibraltar.

20 February 2010

Why does new product/feature development take so long?

This is a post for non-technical readers (particularly non-technical product and high level feature owners) to explain why technology is "so slow" to deliver your new products and features.

Comparative baseline. You need to ask yourself, "Why do I think something is taking too long?" What's my baseline, what am I comparing to when I think "slow"? More often than not, I find that people are just displacing their general frustration in not having something they want (much like a child does), and take it out on the deliverer (the parent).

That, or you're just the type of person that moans about most things generally, so please stop.

I find that some stakeholders either delight in or don't realize they're only selectively comparing one organization to another. Why can Company X deliver a new feature in a week when it takes our tech organization 3 months? They rarely seek to understand why, they just want to selectively pick comparison points to criticize the delivery team.

Of course what they may not seek to understand is that Company X:

May have more and better technologists (perhaps at a higher cost base)

Invested in a technology delivery system that is much more efficient to extend, scale, and maintain (i.e., relatively less technical debt accumulated)

Enables a significantly different approach to IT due to a different revenue and cost structure (e.g., can afford better kit, replaced more often, serviced by better more expensive support channels)

True switching costs. Perhaps you're the person that was frustrated with the rate of in-house delivery, and sought out a big supplier to deliver what you're looking for. Hey, cheaper right? And you got to teach those in-house slackers a lesson. You get bigger economies of scale, and a lot more people to deploy to build your solutions. But then it turns out your knowledge domain is new to the supplier. Congratulations! You just paid for the privilege of bringing a bunch of new and external people up to speed that didn't know your domain from a hole in the ground. And then the supplier takes that knowledge out to other customers, finally creating that economy of scale you were sold on in the first place (Supplier: "Hey, thanks very much for giving me a new area to expand my business and dilute your priorities and control!").

A really good technologist or really thorough domain knowledge, especially both together, is rarely a commodity. A "Java programmer" is a commodity. A "really good Java programmer who understands our customer's requirements and has several year's experience developing our products within our business culture" is not.

You're a "get what you want", "bulldog", "win-lose" negotiator. You might be the type of stakeholder that always demands things be done more quickly than what's been presented to you. You might think that technical trade-off discussions are just technobabble to justify a "heavily padded", "low risk", "plenty of surfing time" schedule. Perhaps you think your a tough negotiator and your stripping out the "fluff" the technologists inserted into the plan to save the shareholder's money. Either way, that means you think you understand better how long things should take as compared to one or more people that just spent days or weeks thinking about the problem.

Unfortunately, you really need to understand and accept four laws of software development physics:

Good technologists rarely over-estimate. Most either want to please you by giving you an aggressive schedule, they think their team is better than they actually are (they take their own personal estimations and extrapolate it for their entire team, one aspect of the Dunning-Kruger Effect), and/or don't think about whole solution delivery timings.

A qualified and professional team of people who spend time thinking about a solution will most likely know more about how long it will take to construct the solution than you do

Technical debt can be ratcheted up and quality down to deliver speed

Assuming the technology team is reasonably competent, dedicated, and professional, the only way to reduce the schedule is to alter some other project dimension

Here's the thing, technology teams can sometimes seemingly magically strip time out of a schedule. However, if you take the time to understand the trade-offs, it's not magic. Technologists can generally remove quality, stability, best practices and "this is how it should be done" dimensions from a project and deliver more quickly. However, by doing so they're increasing technical debt and/or business risk.

As the stakeholder, you'd be wise to track technical debt just like you track project budgets. Unless of course you plan to hand the technical debt over to the next management team and move on to another project. Say, aren't you clever getting that big bonus for an on-time delivery. Too bad the new administration didn't know about all that debt they're acquiring...

Servicing the debt. If you don't manage the product and technical debt, if you don't seek to understand the tradeoffs that are being made when you pressurize delivery schedules, things will start to move even slower than they did before. It will appear to you that your team is getting "worse" over time, their deliveries slower.

This won't initially make sense to you if you don't understand the trade-offs that have been made. The team's technical, domain, and cultural knowledge should be getting better and better so why is everything taking longer now to deliver? When the technologists try to explain why, it's very complicated. In fact, it sounds like the same old technobabble they were trying to use to con you into a heavily padded delivery schedule. Probably best to continue to ignore it like you've always done. Your approach has served you and your delivery schedule well so far. The team is probably just getting burned out and it might be time to switch suppliers! And a new job has opened up elsewhere you'd be perfect for, let someone else struggle with this declining, unmotivated delivery team...

Excuses, Excuses

Let's face it, some individuals and teams are really really bad at what they're supposed to be doing. Maybe, just maybe, you have been an attentive, detail oriented, engaged stakeholder, you do understand the tradeoffs that have been made, and the team you work with simply isn't delivering. One or several of the following may have happened.

First, maybe the team really is bad (as compared to other similar teams). How can that be?

Unaligned expectations - you really can't teach a pig to sing; you've hired someone that while good in their own right, can't possibly be good at what you've hired them to do
Burnout - people really have burned out (guess what, managing burnout is just another technical debt to manage)
Bad hiring - some people really are just incompetent and/or lazy; these types also tend to be good at lying as well
People change - they've just tuned out, perhaps due to personal issues; maybe their worked well in your business' previous culture and context a few years ago but don't in the current one
Bad alignment - maybe someone else in the business is poaching their time
Different priorities - they're delivering just enough not to get fired while they work on their own business or day trade
Poor management and leadership - people don't know what to do and/or aren't enabled to do it; their manager simply isn't managing, the organization isn't enabling them
Sewing seeds of discontent - a few people are really just negative, nasty, and unpleasant; they spread FUD to create discontent in the team and then take pleasure in the results

Second, I have to admit it, a lot of good technologists put architecture, future-proofing, tools, process efficiency, frameworks, scalability, extensibility, and maintainability against actually delivering a single feature. They want to build a double-super-awesome application to dominate all other applications, and it takes a lot of architectural work to do that. The key here is a technology leader that pushes for incremental improvement and delivery along all dimensions (product and feature delivery first and foremost) and makes technical debt levels transparent to stakeholders.

Third, you really have been screwed by a delivery team or supplier. They're farming their alleged 100% allocated team out to three different customers like you. Your SLAs have no teeth. You've been sold using Ruby on Rails for development and even now a team of 30 people in several other countries are reading Ruby for Dummies and wondering why they've been hired to evaluate precious gems and lay railroad tracks but not write software. 50% of the time and budget are gone and you're too heavily invested to change.

Perhaps you really are in one or more of these situations. If you are, you're probably justified in making drastic changes. However, you have to ask yourself how things got that way, and what role you played in it. Because if you're the one that created the bad situation in the first place, are you really qualified to fix or replace it? Do yourself and your shareholders a favor and get some help.

What's the right way?

So what is the formula to speed up delivery? Assuming that you prioritize speed over cost (see The Trinity Extended for more on this):

Acquire a good technology leader. Find someone with credentials and references you trust, and then let them get on with it. If you made a bad choice, then really, that's your or maybe your Boss' fault, isn't it?

Acquire good technologists. Unfortunately, they're rarely cheap, because they know what they're worth. Also unfortunately there are a bunch out there that will take advantage of you so you can also easily end up not getting what you paid for.

Domain knowledge, and even better, interest. Acquire technologists that understand what you want built. They should "get it" when you give a high level overview on what you want. Ideally, you'll find people that have an actual personal interest in what you want done. They want to build software that they would use themselves.

Delivered something similar. Acquire technologists that have a proven track record delivering something similar to what you want to do. It will certainly help with scheduling and anticipating risks.

Cultural alignment. Acquire people that live and breathe within your target market. It's much more likely they'll have an implicit understanding of your customers and what's required right at the start. Also hire people that have worked in companies like yours:

Start-ups vs big companies
New development versus support and extend existing, inherited, purchased products
New development versus integration and middleware
Internal versus external customer facing
b2b versus b2c
Dedicated resourcing (generally apolitical) versus programme "beg-borrow-steal" persuasion/shared model of resourcing (generally political)
Static (not much change, driving down costs) versus dynamic (fast moving market, environment; lots of change)
Reactive (operationally oriented; opportunity led) versus Proactive (project oriented; strategy led)

Knowingly manage technical debt. Sure, you can accelerate now and pay later. Commercial realities may dictate this behavior. Just make sure you knowingly stay aware of how much technical debt you're creating as you go. Set budget, strategy, constraints, horizon expectations clearly and have your tech team explain where debt is being accumulated and likely impacts that will result.

There are many other and similar views on how to deliver fast and well. The whole family of Agile, XP, Scrum, DSDM, FDD, Kanban and other similar methodologies all take views on this. They are all interesting and worthwhile to understand and use as appropriate. But to me most of them don't really emphasize enough the human component, the true difference a really excellent technology individual and/or team can make. Instead if closely followed they tend to treat people equally, reward mediocrity, and put process ahead of people. All technologists are definitely not created equal, and yes, they're not machines they're people.

If you are unable to effectively make a judgment about the situation (good for you, at least you recognize this in yourself), bring in an external IT consultant you trust or with a good reputation to perform an IT audit. Have the consultant take a view on what is and isn't working. And whatever you do, don't bring in a consultant who is a stealth sales implant for a large professional services arm at the same company. Make it very clear that you would use someone else to do any follow-up remediation.

Conclusion

It is possible that you have a poor, slow speed delivery team. Perhaps you do need to significantly alter how you deliver with new people, new suppliers.

However, before you do anything radical, do your business and shareholders a favor. Sit back and think for a minute where the "slow" designation comes from and how objective it really is. If the only common denominator between the in-house and out-sourced speed assessment is you, perhaps it's your judgment of "slow" may be flawed and/or that you really aren't managing your technical debt. And if you're really not sure, get a project/IT audit done by a trusted resource to give you an outside view.

03 January 2010

Using wget and google to download and translate websites

There is a website for the neighborhood I live in that is all in Spanish (cuartonparque.com). So that's useful if your Spanish is good, which mine isn't. Google's translate function is great, but I wanted an archive of the site both in Spanish and English in case the site disappeared or was substantially altered.

wget is a great command line *nix utility to recursively download a website providing the links are statically constructed. I use wget on OS X (install xcode and macports to enable installation of wget if you don't have it).

For cuartonparque.com, the wget command is straight-forward and well documented. The site uses simple static links and only has a few levels of linking. To download the site, I used:

wget -rpkv -e "robots=off" 'http://cuartonparque.com' 2>&1 | tee cuartonparque.com.wget.log

This command creates a cuartonparque.com directory with a browsable website.

To download a translate.google.com version of the site was trickier. Although various googled pages helped a bit, I couldn't find find an example that actually worked. After some hacking about, I uncovered the required tricks to make this work:

Google appears to only process requests from browsers it's familiar with (use -U Mozilla)
Google uses frames and changes it's domain name a bit as it translates (find out the final URL of interest by digging around in the page source)
Safari really likes a .html extension on files it opens (use --html-extension)

My pain is your gain. Here is the wget command that downloads the translated version of the website:

 wget -rpkv -e "robots=off" -U Mozilla --html-extension 'http://translate.googleusercontent.com/translate_c?hl=en&sl=es&tl=en&u=http://cuartonparque.com/&rurl=translate.google.com&twu=1&usg=ALkJrhjabXZlzJpBCZeWpsmLaKss09lCuQ' 2>&1 | tee -a cuartonparque.com.En.wget.log

wget creates a translate.googleusercontent.com directory with a browsable website, localized from Spanish to English with a horrific URL for the index.html page:

 file:///Users/xyz/Downloads/Web%20Sites/cuartonparque.com.En.Google.Trans/translate.googleusercontent.com/translate_c%3Fhl=en&sl=es&tl=en&u=http:%252F%252Fcuartonparque.com%252F&rurl=translate.google.com&twu=1&usg=ALkJrhjabXZlzJpBCZeWpsmLaKss09lCuQ.html

A quick browse around on the downloaded version suggests everything came through, nicely translated to English with wholly-formed pages. Enjoy!

Why this Blog?

I write a lot. I write to get my head around a subject. I write about technology I figure out and use. I write "how we're going to do things" in email and documents to provide advice, guidance, policy and leadership for IT. At some point I realized that most of what I wrote was not proprietary and I was repeating myself as new people joined the team and repeating similar situations. So while my postings are mostly just common sense, it does help me figure things out, give me a stock set of thoughts and "how-to" for future reference and maybe even someone else might find value in them as well.

Views expressed on this website are my own and may or may not reflect the views of my employer.

I reserve rights to all content appearing in this blog if I'm the one that wrote it.