wget is a great command line *nix utility to recursively download a website providing the links are statically constructed. I use wget on OS X (install xcode and macports to enable installation of wget if you don't have it).
For cuartonparque.com, the wget command is straight-forward and well documented. The site uses simple static links and only has a few levels of linking. To download the site, I used:
wget -rpkv -e "robots=off" 'http://cuartonparque.com' 2>&1 | tee cuartonparque.com.wget.log
This command creates a cuartonparque.com directory with a browsable website.
To download a translate.google.com version of the site was trickier. Although various googled pages helped a bit, I couldn't find find an example that actually worked. After some hacking about, I uncovered the required tricks to make this work:
- Google appears to only process requests from browsers it's familiar with (use -U Mozilla)
- Google uses frames and changes it's domain name a bit as it translates (find out the final URL of interest by digging around in the page source)
- Safari really likes a .html extension on files it opens (use --html-extension)
wget -rpkv -e "robots=off" -U Mozilla --html-extension 'http://translate.googleusercontent.com/translate_c?hl=en&sl=es&tl=en&u=http://cuartonparque.com/&rurl=translate.google.com&twu=1&usg=ALkJrhjabXZlzJpBCZeWpsmLaKss09lCuQ' 2>&1 | tee -a cuartonparque.com.En.wget.log
wget creates a translate.googleusercontent.com directory with a browsable website, localized from Spanish to English with a horrific URL for the index.html page:
file:///Users/xyz/Downloads/Web%20Sites/cuartonparque.com.En.Google.Trans/translate.googleusercontent.com/translate_c%3Fhl=en&sl=es&tl=en&u=http:%252F%252Fcuartonparque.com%252F&rurl=translate.google.com&twu=1&usg=ALkJrhjabXZlzJpBCZeWpsmLaKss09lCuQ.html
A quick browse around on the downloaded version suggests everything came through, nicely translated to English with wholly-formed pages. Enjoy!
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.