Category: Perl

ETL into WordPress: lessons learned

I had a chance this weekend to do a little work on importing a large (4000 or so articles and pages) site into WordPress. It was an interesting bit of work, with a certain amount learning required on my part – which translated into some flailing around on to establish the toolset.

Lesson 1: ALWAYS use a database in preference to anything else when you can.

I wasted a couple hours trying to clean up the data for CSV import using any of a number of WordPress plugins. Unfortunately, CSV import is half-assed at best – more like about quarter-assed, and any cleanup in Excel is excruciatingly slow.

Some of the data came out with mismatched quotes, leaving me with aberrant entries in the spreadsheet that caused Excel to throw an out-of-memory error and refuse to process them when I tried to delete the bad rows or even cells from those bad rows.

Even attempting to work with the CSV data using Text::CSV in Perl was problematic because the site export data (from phpMyAdmin) was fundamentally broken. I chalk that partially up to the charset problems we’ll talk about later.

I loaded up the database using MAMP, which worked perfectly well, and was able to use Perl DBI to pull the pages and posts out without a hitch, even the ones with weirdo character set problems.

Lesson 2: address character set problems first

I had a number of problems with the XMLRPC interface to WordPress (which otherwise is great, see below) when the data contained improperly encoded non-ASCII characters. I was eventually forced to write code to swap the strings into hex, find the bad 3 and 4 character runs, and replace them with the appropriate Latin-1 substitutes (note that these don’t quite match that table – I had to look for the ”e2ac’ or ‘c3’ delimiter characters in the input to figure out where the bad characters were. Once I hit on this idea, it worked very well.

Lesson 3: build in checkpointing from the start for large import jobs

The various problems ended up causing me to repeatedly wipe the WordPress posts database and restart the import, which wasted a lot of time. I did not count that toward the overall time needed to complete when I charged my client. If I had, it would have been more like 20-24 hours instead of 6. Fortunately the imports were, until a failure occurred, a start-it-and-forget-it process. It was necessary to wipe the database between tried because WordPress otherwise very carefully preserves all the previous versions, and cleaning them out is even slower.

I hit on the expedient of recording the row ID of an item each time one successfully imported and dumping that list out in a Perl END block. If the program fell over and exited due to a charset problem, I got a list of the rows that had processed OK which I could then add to an ignore list. Subsequent runs could simply exclude those records to get me straight to the stuff I hadn’t done yet and and to avoid duplicate entries.

I had previously tried just logging the bad ones and going back to redo those, but it turned out to be easier to exclude than include.

Lesson 4: WordPress::API and WordPress XMLRPC are *great*.

I was able to find the WordPress::API module on CPAN, which provides a nice object-oriented wrapper around WordPress XMLRPC. With that, I was able to programmatically add posts and pages about as fast as I could pull them out of the local database.

Lesson 5: XMLRPC just doesn’t support some stuff

You can’t add users or authors via XMLRPC, sadly. In the future, the better thing to do would probably be to log directly in to the server you’re configuring, load the old data into the database, and use the PHP API calls directly to create users and authors as well as directly load the data into WordPress. I decided not to embark on this, this time, because I’m faster and more able in Perl than I am in PHP, and I decided it would be faster to go that way than try to teach myself a new programming language and solve the problem simultaneously.

Overall

I’d call this mostly successful. The data made it in to the WordPress installation, and I have an XML dump from WordPress that will let me restore it at will. All of the data ended up where it was supposed to go, and it all looks complete. I have a stash of techniques and sample code to work with if I need to do it again.

June 21, 2014
Mojolicious Revolutions

3rd in my series of talks at SVPerl about Mojolicious; this one reviews using the server-side features for building bespoke mock servers, and adds a quick overview of the Mojo client features, which I had missed until last week. Color me corrected.

Mojolicious Revolutions

March 7, 2014

CrashPlan folder date recovery

The situation: a friend had a MacBook Air whose motherboard went poof. Fortunately she had backups (almost up-to-date) in CrashPlan, so she did a restore of her home directory, which worked fine in that she had her files, but not so fine in that all the folder last-changed dates now ran from the current date to a couple days previous (it takes a couple days to recover ~60GB of data).

This was a problem for her, because she partly uses the last-changed date on her folders to help her keep organized. “When was the last time I did anything on project X?” (I should note: she uses Microsoft Word and a couple different photo library managers, so git or the equivalent doesn’t work well for her workflow. She is considering git or the like now for her future text-based work…)

A check with CrashPlan let us know that they did not track folder update dates and couldn’t restore them. We therefore needed to come up with a way to re-establish as best we could what the dates were before the crash.

My original thought was simply to start at the bottom and recursively restore the folder last-used dates using touch -t, taking the most-recently-updated file in the folder as the folder’s last-updated date. Some research (read: trying it) and thought turned up the following:

Updating a file updated the folder’s last-updated date.
Updating a folder did not update the containing folder’s last-updated date.

This meant that we couldn’t precisely guarantee that the folder’s last-updated date would accurately reflect the last update of its contents. We decided in the end that the best strategy for her was to “bubble up” the last-updated dates by checking both files and folders contained in a subject folder. This way, if a file deep in the hierarchy is updated, but the files and folders above it have not been, the file’s last-updated date is applied to its containing folder, and subsequently is applied also to each containing folder (since we’re checking both files and folders, and there’s always a folder that has the last-updated date that corresponds to the one on the deeply-nested file). This seemed like the better choice for her as she had no other records of what had been worked on when, and runs a very nested set of folders.

If you were running a flatter hierarchy, only updating the folders to the last-updated date of the files might be a better choice. Since I was writing a script to do this anyway, it seemed reasonable to go ahead and implement it so that you could choose to bubble up or not as you liked, and to also allow you to selectively bubble-up or not in a single directory.

This was the genesis of date-fixer.pl. Here’s the script. A more detailed example of why neither approach to restoring the folder dates is perfect is contained in the POD.

use strict;
use warnings;
use 5.010;

=head1 NAME

date-fixer.pl - update folder dates to match newest contained file

=head1 SYNOPSIS

date-fixer.pl --directory top_dir_to_fix
             [--commit]
             [--verbose]
             [--includefolders]
             [--single]

=head1 DESCRIPTION

date-fixer.pl is meant to be used after you've used something like CrashPlan to restore your files. The restore process will put the files back with their proper dates, but the folders containing those files will be updated to the current date (the last time any operation was done in this folder - specifically, putting the files back).

date-fixer.pl's default operation is to tell you what it would do; if you want it to actually do anything, you need to add the --commit argument to force it to actually execute the commands that change the folder dates.

If you supply the --verbose argument, date-fixer.pl will print all the commands it is about to execute (and if you didn't specify --includefolders, warn you about younger contained folders - see below). You can capture these from STDOUT and further process them if you like.

=head2 Younger contained folders and --includefolders

Consider the following:

    folder1           (created January 2010 - date is April 2011)
        veryoldfile 1 (updated March 2011)
        oldfile2      (updated April 2011)
        folder2       (created June 2012 - date is July 2012)
            newfile   (updated July 2012)

If we update folder1 to only match the files within it, we won't catch that folder2's date could actually be much more recent that that of either of the files directly contained by folder1. However, if we use contained folder dates as well as contained file dates to calculate the "last updated" date of the current folder, we may make the date of the current folder considerably more recent than it may actually have been.

Example: veryoldfile1 and oldfile2 were updated in March and April 2011.
Folder2 was updated in June 2012, and newfile was added to in in July 2012. The creation of folder2 updates the last-updated date of folder1 to June 2012; the addition of newfile updates folder2's last-updated date to that date -- but the last-updated date of folder1 does not change - it remains June 2012.

If we restore all the files and try to determine the "right" dates to set the folder update dates to, we discover that there is no unambiguous way to decide what the "right" dates are. If we use the file dates, alone, we'll miss that folder2 was created in June (causing folder1 to update to June); if we use both file and folder dates, we update folder1 to July 2012, which is not accurate either.

date-fixer.pl takes a cautious middle road, defaulting to only using the files within a folder to update that folder's last-modified date. If you prefer to ensure that the newest date buried in a folder hierarchy always "bubbles up" to the top, add the --includefolders option to the command.

date-fixer will, in verbose mode, print a warning for every folder that
contains a folder younger than itself; you may choose to go back and adjust the dates on those folders with

date-fixer.pl --directory fixthisone --includefolders --single

This will, for this one folder, adjust the folder's last-updated date to the most recent date of any of the items contained in it.

=head1 USAGE

To fix all the dates in a directory and all directories below it, "bubbling up" dates from later files:

    date-fixer.pl --directory dir --commit --includefolders

To fix the dates in just one directory based on only the files in it and
ignoring the dates on any directories it contains:

    date-fixer.pl --directory dir --commit --single

To see in detail what date-fixer is doing while recursively fixing dates, "bubbling up" folder dates:

    date-fixer.pl --directory dir --commit --verbose --includefolders

=head1 NOTES

"Why didn't you use File::Find?"

I conceived the code as a simple recursion; it seemed much easier to go ahead and read the directories myself than to go through the mental exercise of transforming the treewalk into an iteration such as I
would need to use File::Find instead.

=head1 AUTHOR

Joe McMahon, mcmahon@cpan.org

=head1 LICENSE

This code is licensed under the same terms as Perl itself.

=cut

use Getopt::Long;
use Date::Format;

my($commit, $start_dir, $verbose, $includefolders, $single);
GetOptions(
    'commit' => \$commit,
    'directory=s' => \$start_dir,
    'verbose|v' => \$verbose,
    'includefolders' => \$includefolders,
    'single' => \$single,
);

$start_dir or die "Must specify --directory\n";

set_date_from_contained_files($start_dir);

sub set_date_from_contained_files {
    my($directory) = @_;
    return unless defined $directory;

    opendir my $dirhandle, $directory
        or die "Can't read $directory: $!\n";
    my @contents;
    push @contents, $_ while readdir($dirhandle);
    closedir $dirhandle;

    @contents = grep { !/\.$|\.\.$/ } @contents;
    my @dirs = grep { -d "$directory/$_" } @contents;

    my %dirmap;
    @dirmap{@{[@dirs]}} = ();

    my @files = grep { !exists $dirmap{$_}} @contents;

    # Recursively apply the same update criteria unless --single is on.
    unless ($single) {
        foreach my $dir (@dirs) {
            set_date_from_contained_files("$directory/$dir");
        }
    }

    my $most_recent_date;
    if (! $includefolders) {
         $most_recent_date = most_recent_date($directory, @files);
         my $most_recent_folder = most_recent_date($directory, @dirs);
         warn "Folders in $directory are more recent ($most_recent_folder) than the most-recent file ($most_recent_date)\n";
    }
    else {
         $most_recent_date = most_recent_date($directory, @files, @dirs);
    }

    if (defined $most_recent_date) {
        (my $requoted = $directory) =~ s/'/\\'/g;
        my @command = (qw(touch -t), $most_recent_date, $directory);
        print "@command\n" if $verbose;
        system @command if $commit;
    }
    else {
        warn "$directory unchanged because it is empty\n" if $verbose;
    }
}

sub most_recent_date {
    my ($directory, @items) = @_;
    my @dates =     map  { (stat "$directory/$_")[9] } @items;
    my @formatted = map  { time2str("%Y%m%d%H%M.%S", $_) } @dates;
    my @ordered =   sort { $a lt $b } @formatted;
    return $ordered[0];
}

October 8, 2013

Category: Perl

ETL into WordPress: lessons learned

Mojolicious Revolutions

CrashPlan folder date recovery