03 January 2010

Using wget and google to download and translate websites

There is a website for the neighborhood I live in that is all in Spanish (cuartonparque.com).  So that's useful if your Spanish is good, which mine isn't.  Google's translate function is great, but I wanted an archive of the site both in Spanish and English in case the site disappeared or was substantially altered.

wget is a great command line *nix utility to recursively download a website providing the links are statically constructed.  I use wget on OS X (install xcode and macports to enable installation of wget if you don't have it).

For cuartonparque.com, the wget command is straight-forward and well documented.  The site uses simple static links and only has a few levels of linking. To download the site, I used:

wget -rpkv -e "robots=off" 'http://cuartonparque.com' 2>&1 | tee cuartonparque.com.wget.log

This command creates a cuartonparque.com directory with a browsable website.

To download a translate.google.com version of the site was trickier.  Although various googled pages helped a bit, I couldn't find find an example that actually worked.  After some hacking about, I uncovered the required tricks to make this work:
  • Google appears to only process requests from browsers it's familiar with (use -U Mozilla)
  • Google uses frames and changes it's domain name a bit as it translates (find out the final URL of interest by digging around in the page source)
  • Safari really likes a .html extension on files it opens (use --html-extension)
My pain is your gain.  Here is the wget command that downloads the translated version of the website:

wget -rpkv -e "robots=off" -U Mozilla --html-extension 'http://translate.googleusercontent.com/translate_c?hl=en&sl=es&tl=en&u=http://cuartonparque.com/&rurl=translate.google.com&twu=1&usg=ALkJrhjabXZlzJpBCZeWpsmLaKss09lCuQ' 2>&1 | tee -a cuartonparque.com.En.wget.log

wget creates a translate.googleusercontent.com directory with a browsable website, localized from Spanish to English with a horrific URL for the index.html page:

file:///Users/xyz/Downloads/Web%20Sites/cuartonparque.com.En.Google.Trans/translate.googleusercontent.com/translate_c%3Fhl=en&sl=es&tl=en&u=http:%252F%252Fcuartonparque.com%252F&rurl=translate.google.com&twu=1&usg=ALkJrhjabXZlzJpBCZeWpsmLaKss09lCuQ.html

A quick browse around on the downloaded version suggests everything came through, nicely translated to English with wholly-formed pages.  Enjoy!

02 January 2010

DBA Evolution and Specialization

Introduction.  DBAs are an expensive resource.  Part of the art of IT management is to manage IT costs against revenue.  Here are the growth phases of businesses I've seen and how the DBA function evolves and specializes against changing revenues, risk tolerance, and work load.

Phase 1 - Startup, limited cash, keep the burn rate down.  When a business first starts out and afford only a few technical resources, chances are you'll have an all-rounder technologist also acting as a DBA.  As a little more money comes available or a risk factor hits (typically a crippling systems, security or change management failure), an all-rounder systems admin may be hired that picks up some of the DBA responsibilities, typically at the edges of the database (e.g., backups, storage).

Phase 2A - Revenues increasing, risk tolerance slightly decreasing.  Revenues and risk management warrant hiring a full-time dedicated DBA.  The DBA starts reviewing developer driven database changes and eventually the software developers get shut out of production databases to better control change.  The DBA takes over backups from the all-rounder systems admin.  Confidence in DB recovery goes up.  The DBA takes over ownership of the fundamental data model and supplements the report generation otherwise being performed by the sw devs.  The DBA can't ever go on vacation as they are on the support escalation call-out and are business critical.  Your one DBA is an all-rounder and supports all databases throughout the business - some well, some not so well.

Phase 2B - Marginally net profitable, and risk tolerance decreasing to protect profits.  The DBA is becoming overloaded and would really like a stress-free vacation.  You've probably increased your number of schemas and instances of your main database and a few other database packages have appeared.  DBAs are expensive, so rather than hiring DBAs to cover a 24x7 rota, you hire maybe one more DBA to deal with overall growth of requests.  You have to outsource database services to provide escalated issue support and to support non-primary database packages.  Your 1-2 DBAs are all-rounders with respect to the database packages you support.

Phase 3 - Business successful, profits increasing and scalability is challenging.  Now you have to take the plunge and split up your all-rounder DBAs responsibilities and ownership.  Generally, there are three areas of DBA specialization:

  1. Specific product(s) orientation, business-facing.  These DBAs understand the data and business model by product (product composed of one or more inter-related applications, schemas).  Products may be in-house developed or from third parties but either way require IT depth of expertise to meet business objectives.  These DBAs understand how data ties together and business logic as it related to data manipulation.  They communicate well and are more business then technically oriented.  They can produce reports against the data model and these reports are consistent with each without super-detailed report requirements definition due to the DBA's understanding of the business and its product data model and business logic.
  2. Specific product(s) orientation, developer facing.  These DBAs are more technical.  They work closely with the software developers of in-house developed products.  They are application developers in that they write code within the database platform itself, such as stored procedures.  They take developer database code, DML, and DDL and review, clean-up, and optimize it before it enters production.  They manage change of the application within the database and write the scripts to migrate/upgrade the application's DB based data structures.  They troubleshoot application problems that are within the database and an application layer.  Depending on the systems architecture, they may also have a passing familiarity with application persistence engines like Java's Hibernate to be able to troubleshoot issues between Oracle and the application persistence layer.
  3. Specific DB platform(s), operationally oriented, systems facing.  These DBAs are even more technical.  They work closely with the systems administration team at the "edges" of the database platform.  These DBAs aren't so much concerned with the application or data structures within the database as they think of the database as a complex "black box".  They are responsible for backups, performance tuning of the DB platform as a whole, OS touch points (e.g., shared memory), storage (e.g., filesystem type, performance, allocation and placement), clustering for fault tolerance and/or scaling, change control around the database (e.g., config files), root cause DB faults (e.g., Oracle's ORA errors), upgrade and patching of the DB software, and maintaining the DB operational runbook.  They are skilled at troubleshooting the database as a whole, in and around it's operating context and in the space between clients and DB itself.

To further scale the DB function, each of the three above can be split out by product/application and by DB platform.

Hopefully one of those all-rounder DBAs you have from Phase 2 has team management aspirations and can step into an overall DBA leadership and management role as you'll need roadmap and backlog management for all the DB platforms and products/apps you have in Phase 3.

Phase 4 - No idea!  The above 3 phases should support an IT team size in my industry into upper 100s.  I look forward to the opportunity to see and maybe help shape something bigger in the future!

Conclusion and Recommendation.  Use the above phases to figure out where your organization is at and manage your DBA capacity against revenue, budget, and risk tolerance accordingly.  DBAs are expensive, but also one of the most business critical functions IT provides.