Scraping Patchstorage

I lost an important VCVRack patch a couple days before Mountain Skies 2019. It was based on a patch I’d gotten from patchstorage.com, but I couldn’t remember which patch it was. I tried paging through the patches on the infinite scroll, but it wasn’t helping me much. I knew the patch had Clocked and the Impromptu 16-step sequencer, but I couldn’t remember anything else about it after seriously altering it for my needs.

I decided the only option was going to have to be automated if I was going to find the base patch again in time to recreate my performance patch. I hammered out the following short Perl script to download the patches:

use strict;
use warnings;
use WWW::Mechanize;
use WWW::Mechanize::TreeBuilder;

$|++;

my $base_url = "https://patchstorage.com/platform/vcv-rack/page/";
my $mech = WWW::Mechanize->new(autocheck=>0);
WWW::Mechanize::TreeBuilder->meta->apply($mech);
use constant SLEEP_TIME => 2;

my $seq = 1;
my $working = 1;
while ($working) {
  print "page $seq\n";
  $mech->get($base_url.$seq);
  sleep(SLEEP_TIME);
  my @patch_pages = $mech->look_down('_tag', 'a');
  my @patch_links = grep {
    defined $_ and
    !m[/upload\-a\-patch\/] and
    !m[/login/] and
    !m[/new\-tutorial/] and
    !m[/explore/] and
    !m[/registration/] and
    !m[/new\-question/] and
    !m[/explore/] and
    !m[/platform/] and
    !m[/tag/] and
    !m[/author/] and
    !m[/wp\-content/] and
    !m[/category/] and
    !/\#$/ and
    !/\#respond/ and
    !/\#comments/ and
    !/mailto:/ and
    !/\/privacy\-policy/ and
    !/discord/ and
    !/https:\/\/vcvrack/ and
    !/javascript:/ and
    !/action=lostpassword/ and
    !/patchstorage.com\/$/ and
    ! $_ eq ''} map {$_->attr('href')} @patch_pages;
    my %links;
    @links{@patch_links} = ();
    @patch_links = keys %links;
    print scalar @patch_links, " links found\n";
    for my $link (@patch_links) {
      next unless $link;
      print $link;
      my @parts = split /\//, $link;
      my $patch_name = $parts[-1];
      if (-f "/Users/jmcmahon/Downloads/$patch_name") {
        print "...skipped\n";
        next;
      }
      print "\n";
      $mech->get($link);
      sleep(SLEEP_TIME);
      my @patches = $mech->look_down('id', "DownloadPatch");
      for my $patch (@patches) {
        my $p_link = $patch->attr('href');
        next unless $p_link;
        print "$patch_name...";
        $mech->get($patch->attr('href'));
        sleep(SLEEP_TIME);
        open my $fh, ">", "/Users/jmcmahon/Downloads/$patch_name" or die "Can't open $patch_name: $!";
        print $fh $mech->content;
        close $fh;
        print "saved\n";
      }
    }
    $seq++;
 }

Notable items here:

  • The infinite scroll is actually a chunk of Javascript wrapped around a standard WordPress page setup, so I can “page” back through the patches for Rack by incrementing the page number and pulling off the links to the actual posts with the patches in them.
  • That giant grep and map cleans up the links I get off the individual pages to just the ones that are actually links to patches.
  • I have a couple checks in there for “have I already downloaded this?” to allow me to restart the script if it dies partway through the process.
  • The script kills itself off once it gets a page with no links on it. I haven’t actually gotten that far yet, but I think it should work.

Patchstorage folks: I apologize for scraping the site, but this is for my own use only; I”m not republishing. If I weren’t desperate to retrieve the patch for Friday I would have just left it alone.

Comments

Leave a Reply