MyFamily Archive

Last Updated: 16-May-2008 08:08:57
         
       $Log: how.php $
       Revision 1.1  2008/05/16 13:08:57  tc
       Initial revision


        

This is a short description of how I tackled archiving a MyFamily.com site.

The basic approach is fairly simple:

  • grab all the pages and images at the original site,
  • store these files to some archive, and then
  • for each static page in the archive, substitute all references to the original site with the applicable reference to the archive site

The archive itself is an almost flat store of static files. The mapping of uri's from the original MyFamily site to names in the archive is maintained in a database:

CREATE DATABASE `MyFamily`;
USE `MyFamily`;

CREATE TABLE `Uris`
(`id`				INT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY,
 `sUri`				VARCHAR(767) NOT NULL UNIQUE KEY,
 `sAttribute`			VARCHAR(255) NOT NULL,
 `sTag`				VARCHAR(255) NOT NULL,
 `sContext`			VARCHAR(255) NOT NULL,
 `idUriSubstitute`		INT UNSIGNED NOT NULL);
CREATE TABLE `UriSubstitutes`
(`id`				INT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY,
 `sUri`				VARCHAR(767) NOT NULL UNIQUE KEY,
 `vCollected`			BOOLEAN NOT NULL,
 `vConverted`			BOOLEAN NOT NULL,
 `sMD5Original`			CHAR(22) DEFAULT NULL,
 `iLengthOriginal`		INT UNSIGNED DEFAULT NULL,
 `sMD5Converted`		CHAR(22) DEFAULT NULL,
 `iLengthConverted`		INT UNSIGNED DEFAULT NULL);
   

The UrisSubstitutes table is seeded with two special substitutes

  • "" which is used to indicate an unassigned mapping, and
  • "-" which is used to indicate that no mapping exists.
INSERT INTO `UriSubstitutes` (`id`, `sUri`, `vCollected`, `vConverted`)
 VALUES (1, '', 0, 0),
	(2, '-', 0, 0);
   

The Uris table is seeded with the front page of the Myfamily's site and is mapped to the unassigned substitute.

INSERT INTO `Uris` (`sUri`, `sAttribute`, `sTag`, `sContext`, `idUriSubstitute`)
 VALUES ('http://www.myfamily.com/isapi.dll?c=site&htx=main&MemberID=______&SiteID=QAR',
	 'href', 'a', 'Login Page', 1);
   

The database is populated and maintained by a number of Perl programs including:

  • Step00.0_AssignSubstitutes.pl
    For each currently unassigned uri, a name is created for the file archive. The name for images, stylesheets and scripts is derived from the original uri. The name for all others are simply assigned modified sequence numbers, for instance, "/page/0123.html".
  • Step01.0_Synchronize.pl
    This procedure keeps the file archive in sync with the database; for any substitute that does not exist in the file archive, this program fetches that page from the original MyFamily site and places it in the archive.
  • Step02.0_CullDuplicates.pl
    This culls duplicates in the file archive based on matching MD5 signatures.
  • Step02.1_CullDuplicates.pl
    This culls duplicates in the file archive based on matching HTML characteristics.
  • Step03.0_CollectUris.pl
    Files in the archive are HTML-parsed; for all uri's encountered, if that uri does not exist in the database, it is added and mapped to the unassigned substitute.
  • Step10.0_ConvertFiles.pl
    For each file in the archive, the file is HTML-parsed; for uri's which have a substitute, the uri is replaced. If no mapping exists, the link is disabled.

The archive is built by doing a number of iterations of (Assign, Synchronize, Cull x2, Collect) and then finally Convert.

In the course of this project, a number of problems were encountered. While my own mistakes contributed their fair share of time consumption, by far the largest time sinks were associated with the MyFamily site itself. Of note were these:

  • large numbers of duplicate and almost-duplicate pages
    some parameters in queries made little to no difference in the fetched page; of the over 2000 pages fetched, just about 400 ended up being unique,
  • malformed HTML
    by far the worst problems encountered were overlapping structures. For instance, something like this:
    <form>
     <table>
    </form>
     </table>
    	
    where a form overlaps a table. The solution adopted was to preprocess the HTML to a properly structured HTML stream. This could then be fed to the Perl HTML parser (HTML::TreeBuilder).

Here is one program that demonstrates usage of the database.

#!/usr/bin/perl -w

# $Id: how.php 1.1 2008/05/16 13:08:57 tc Exp tc $

use strict;
use lib "lib";
use DBI;
use Common;
use tc;

# --------------------------------------------- main ---------------------------------------------
my $hDB;
my $pRows;
my $pFields;
my $id;
my $sUri;
my $sAttribute;
my $sTag;
my $sContext;
my $idUriSubstitute;
my $sUriSubstitute;
my $vCollected;
my $vConverted;
my $sMD5Original;
my $iLengthOriginal;
my $sMD5Converted;
my $iLengthConverted;
my $i;

$hDB = DBI->connect ("dbi:mysql:MyFamily", "localuser", "", {PrintError => false, PrintWarn => false})
 || die ("Failed to open MyFamily Database - $DBI::errstr");
$hDB->begin_work () || die ("Begin_Work failed - " . $DBI::errstr);

$pRows = $hDB->selectall_arrayref
 (  'SELECT `Uris`.`id`, `Uris`.`sUri`, `sAttribute`, `sTag`, `sContext`, `idUriSubstitute`, '
  .	   '`UriSubstitutes`.`sUri`, `vCollected`, `vConverted`, `sMD5Original`, `iLengthOriginal`, '
  .	   '`sMD5Converted`, `iLengthConverted` '
  .  'FROM `Urisubstitutes` INNER JOIN `Uris` '
  .   'ON (`idUriSubstitute` = `Urisubstitutes`.`id`)')
  || die ("Collection Select failed - $DBI::errstr");

foreach $pFields (@{$pRows})
   {($id, $sUri, $sAttribute, $sTag, $sContext, $idUriSubstitute, $sUriSubstitute, $vCollected, $vConverted,
     $sMD5Original, $iLengthOriginal, $sMD5Converted, $iLengthConverted) = @{$pFields};

    for ($i = 0; $i < scalar (@{$pFields}); $i++)
       {if (!defined (${$pFields} [$i]))
	   {${$pFields} [$i] = "NULL";};};

    print "\n", join (" - ", @{$pFields});};

print "\n\nTotal records: " . scalar (@{$pRows});

$hDB->commit () || die ("Commit failed - " . $DBI::errstr);
$hDB->disconnect || die ("Disconnect failed - " . $DBI::errstr);
   
Contents copyright © 1999-2017  Terrance R. Cassidy, Merrimack, New Hampshire, USA - all rights reserved.

Count
Information about this page is available to Javascript-enabled browsers.
Valid CSS! Valid XHTML 1.1 Made with NoteTab