03 January 2010

Using wget and google to download and translate websites

There is a website for the neighborhood I live in that is all in Spanish (cuartonparque.com).  So that's useful if your Spanish is good, which mine isn't.  Google's translate function is great, but I wanted an archive of the site both in Spanish and English in case the site disappeared or was substantially altered.

wget is a great command line *nix utility to recursively download a website providing the links are statically constructed.  I use wget on OS X (install xcode and macports to enable installation of wget if you don't have it).

For cuartonparque.com, the wget command is straight-forward and well documented.  The site uses simple static links and only has a few levels of linking. To download the site, I used:

wget -rpkv -e "robots=off" 'http://cuartonparque.com' 2>&1 | tee cuartonparque.com.wget.log

This command creates a cuartonparque.com directory with a browsable website.

To download a translate.google.com version of the site was trickier.  Although various googled pages helped a bit, I couldn't find find an example that actually worked.  After some hacking about, I uncovered the required tricks to make this work:
  • Google appears to only process requests from browsers it's familiar with (use -U Mozilla)
  • Google uses frames and changes it's domain name a bit as it translates (find out the final URL of interest by digging around in the page source)
  • Safari really likes a .html extension on files it opens (use --html-extension)
My pain is your gain.  Here is the wget command that downloads the translated version of the website:

wget -rpkv -e "robots=off" -U Mozilla --html-extension 'http://translate.googleusercontent.com/translate_c?hl=en&sl=es&tl=en&u=http://cuartonparque.com/&rurl=translate.google.com&twu=1&usg=ALkJrhjabXZlzJpBCZeWpsmLaKss09lCuQ' 2>&1 | tee -a cuartonparque.com.En.wget.log

wget creates a translate.googleusercontent.com directory with a browsable website, localized from Spanish to English with a horrific URL for the index.html page:

file:///Users/xyz/Downloads/Web%20Sites/cuartonparque.com.En.Google.Trans/translate.googleusercontent.com/translate_c%3Fhl=en&sl=es&tl=en&u=http:%252F%252Fcuartonparque.com%252F&rurl=translate.google.com&twu=1&usg=ALkJrhjabXZlzJpBCZeWpsmLaKss09lCuQ.html

A quick browse around on the downloaded version suggests everything came through, nicely translated to English with wholly-formed pages.  Enjoy!

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.