Scraping Patchstorage

I lost an important VCVRack patch a couple days before Mountain Skies 2019. It was based on a patch I’d gotten from patchstorage.com, but I couldn’t remember which patch it was. I tried paging through the patches on the infinite scroll, but it wasn’t helping me much. I knew the patch had Clocked and the Impromptu 16-step sequencer, but I couldn’t remember anything else about it after seriously altering it for my needs.

I decided the only option was going to have to be automated if I was going to find the base patch again in time to recreate my performance patch. I hammered out the following short Perl script to download the patches:

use strict;
use warnings;
use WWW::Mechanize;
use WWW::Mechanize::TreeBuilder;

$|++;

my $base_url = "https://patchstorage.com/platform/vcv-rack/page/";
my $mech = WWW::Mechanize->new(autocheck=>0);
WWW::Mechanize::TreeBuilder->meta->apply($mech);
use constant SLEEP_TIME => 2;

my $seq = 1;
my $working = 1;
while ($working) {
  print "page $seq\n";
  $mech->get($base_url.$seq);
  sleep(SLEEP_TIME);
  my @patch_pages = $mech->look_down('_tag', 'a');
  my @patch_links = grep {
    defined $_ and
    !m[/upload\-a\-patch\/] and
    !m[/login/] and
    !m[/new\-tutorial/] and
    !m[/explore/] and
    !m[/registration/] and
    !m[/new\-question/] and
    !m[/explore/] and
    !m[/platform/] and
    !m[/tag/] and
    !m[/author/] and
    !m[/wp\-content/] and
    !m[/category/] and
    !/\#$/ and
    !/\#respond/ and
    !/\#comments/ and
    !/mailto:/ and
    !/\/privacy\-policy/ and
    !/discord/ and
    !/https:\/\/vcvrack/ and
    !/javascript:/ and
    !/action=lostpassword/ and
    !/patchstorage.com\/$/ and
    ! $_ eq ''} map {$_->attr('href')} @patch_pages;
    my %links;
    @links{@patch_links} = ();
    @patch_links = keys %links;
    print scalar @patch_links, " links found\n";
    for my $link (@patch_links) {
      next unless $link;
      print $link;
      my @parts = split /\//, $link;
      my $patch_name = $parts[-1];
      if (-f "/Users/jmcmahon/Downloads/$patch_name") {
        print "...skipped\n";
        next;
      }
      print "\n";
      $mech->get($link);
      sleep(SLEEP_TIME);
      my @patches = $mech->look_down('id', "DownloadPatch");
      for my $patch (@patches) {
        my $p_link = $patch->attr('href');
        next unless $p_link;
        print "$patch_name...";
        $mech->get($patch->attr('href'));
        sleep(SLEEP_TIME);
        open my $fh, ">", "/Users/jmcmahon/Downloads/$patch_name" or die "Can't open $patch_name: $!";
        print $fh $mech->content;
        close $fh;
        print "saved\n";
      }
    }
    $seq++;
 }

Notable items here:

The infinite scroll is actually a chunk of Javascript wrapped around a standard WordPress page setup, so I can “page” back through the patches for Rack by incrementing the page number and pulling off the links to the actual posts with the patches in them.
That giant grep and map cleans up the links I get off the individual pages to just the ones that are actually links to patches.
I have a couple checks in there for “have I already downloaded this?” to allow me to restart the script if it dies partway through the process.
The script kills itself off once it gets a page with no links on it. I haven’t actually gotten that far yet, but I think it should work.

Patchstorage folks: I apologize for scraping the site, but this is for my own use only; I”m not republishing. If I weren’t desperate to retrieve the patch for Friday I would have just left it alone.

Comments

Leave a Reply Cancel reply

More posts

Building ‘use English;’ into the Perl core

Email handling: a rant

Azuracast metadata redux

More adventures in metadata