Thursday, December 16, 2004

Unix - wget

I am using wget to get archives of my blogs. I have a perl script where I call wget. This perl script is called once a week to archive the blogs. This is done by crontab. I keep the last 4 archives and delete the 5. one.

I am using wget 1.8 on a unix machine. This is the command I use:

/usr/local/bin/wget -a ${blog}.log -Dphotos1.blogger.com,blogspot.com -r -l1 -p -H -k -E http://$blogs{$blog}

$blog is the variable for the individual blog. %blogs is the hash which has the URL of the blog.
-a is used to append to the logfile $blog.log
-r is to retrieve recursively
-l is the level of recursion
-p is to download all page requisites
-H is to enable jumping to other domains
-k is to change the links to point to the local copies (convert links)
-E is to change *.cgi kind of file names to .html

I had some pictures which are somewhere in photos1.blogger.com so I wanted to include that domain together with blogspot.com where my blog is hosted to be able to get the pictures as well. But I couldn't manage to do it. If by changing some arguments I get the pictures, wget did not get files other than index.html from my blog. At the end I just gave up... Will look into it later.

perl:

In the perl code, use absolute positions of the directories and also wget. It causes problems with crontab otherwise.

crontab:

0 3 1,8,15,22 * * abs_position/arc_blogs.pl >/dev/null 2>/dev/null

Perl script is called at 3am every 1st, 8th, 15th, and 22nd days of the month.

0 Comments:

Post a Comment

<< Home