This bug is…32 years old?

Last month, while working in the Perl debugger, I came across this strange behavior:

  DB<4> l 1.2
1.2:	print "this is a placeholder";

Well, that doesn’t look right.

  DB<5> l
2.2 	
3.2 	sub lorem {
4.2:	  for(my $i = 1; $i < 100; $i++) {
5.2:	    print <<EOF;
6.2 	    Lorem
7.2 	    ipsum
8.2 	    dolor
9.2 	    sit
10.2 	    amet,
11.2 	EOF

Okay, that really doesn’t look right.

  DB<5> l 10.0.0.5
10.0.0.5 	    amet,

…I think there’s definitely something wrong here.

 l $17$
Scalar found where operator expected (Missing operator before "$;"?) at (eval 153)[/Users/joemcmahon/perl5/perlbrew/perls/perl-5.38.0/lib/5.38.0/perl5db.pl:742] line 1, near "$17$;"

Oh, no. No no no.

So I dug in to the debugger code, and found this line:

2805 elsif (my ($s, $e) = $spec =~ /^(?:(-?[\d\$\.]+)(?:[-,]([\d\$\.]+))?)?/ ) {

This is the code that matches the arguments for the debugger’s l command for either l start,stop or l start-stop. The problem is that there are a lot of things that this matches that absolutely do not make any sense, like

l ......
l -.
l $
l -.$.
l -12
1 3.14
l 127.0.0.1
l 17$

Worse, as we saw above, some of them kind of work. Others…just don’t do anything, but are accepted.

What causes this?

First question: why does this work at all? The basic reason is that Perl is all about implicit conversions. Several of those expressions evaluate to 0 in a numeric context, so the debugger effectively executes a l 0. This doesn’t list anything, because the program source in the debugger’s magic source code array starts at line 1. But Perl arrays start at 0, and therefore index 0 in the magic source array exists; its just empty. So the debugger happily lists nothing.

For the floating point numbers and IP addresses, Perl captures the list command argument as a string, and then uses it in a loop that uses the “line number” as an index into that same magic source code array, incrementing it by 1 each time. The floating-point-looking strings convert to an actual float, which can be incremented by 1, and the index into the array is automatically truncated to an int when we try to use it. So the debugger finds the lines in the array, and prints the current value of the index it’s using as the line number, resulting in floating-point line numbers.

So why is this complicated regex here at all? Why aren’t we being a lot more strict about this?

Digging into the commit history

I sat down with the perl5 repo from Github and went spelunking. Going back through the commits shows that exact pattern’s been there for over 30 years – since Perl 3. It’s just been cargo-culted forward since then.

Let’s establish some history before we do anything else.

(Aside: this was an interesting exercise, and I learned a bit more about navigating a very long and complex commit history with git. The big thing to keep in mind is that you can navigate much more easily through time with git reset --hard than with git checkout. I still ended up hitting a breaking commit of the OS/2 Makefile in there that prevented me from jumping all the way back to HEAD, and…actually kind of breaks the repo. There’s probably an easy way to repair it, but I ended up deleting the repo altogether and recloning to fix it.)

Here’s a summary of the evolution of the debugger and its l command over time, all the way back to the original creation of the debugger. I got this by repeatedly using git blame to find the commit that added a change to the line with the regex, and then git reset -hard commit^ to jump back in time, repeating until we get so far back that there’s no debugger anymore.

CommitVersion and notesRange-match regex
a559c25918bperl 1.0 patch 8

This is the original perldb commit. Parse loop is a for(;;) starting at line 99.

l min-max List lines, lines 131-135
l line List line, lines 137-140
l List the whole program, lines 141-146

Pattern is very simple, but doesn’t outlaw non-digits in the range.
/^l (.*)[-,](.*)/
13281fa4f8perl 2.0 patch 1, still perldb at the top level of the source directory

parse loop is still for(;;) starting at line 100

No changes in the l command support; lines have moved down 1. Pattern for range is the same.
Same
a687059cbafperl 3.0, moved to lib/perldb.pl

Command parse while() loop at line 64.

Command parsing depends heavily on fallthrough to work.

Lines 145-157 handle l subroutine_name with @sub, falling through to range code to do the command.
Lines 166-168 handle bare l: set up l start-end and fall through
Lines 169-173 handle l start+number: set up corresponding l start-end and fall through
Lines 174-168 actually do the l start-end

First appearance of the complex match including $ and .. Only . gets special handling (current line). No leading - accepted yet, and no handling of $, though the match accepts it.
/^l\s*(([\d\$.]+)(-,)?)?/
fe14fcc35f7perl 4.0.00

Command parse now a while loop with CMD label at line 90; command use next to restart the REPL.

No changes to the supported commands from the 3.0 debugger. Patterns now use \b to delimit the command from its arguments, but the actual commands remain the same; processing still uses fallthrough as before.

Lines 201-219 handle l subroutine: find in @sub, stack l start-end, fall through.
Lines 231-235 handle l nnnn+mmm: compute end for l start-end, fall through.
Lines 236-248 handle l range

Complex match is extended to match all of the arguments. $ and . accepted.
/^l\b\s*(([\d\$.]+)([-,]([\d\$\.]+))?)?/
d338d6fe1dfperl 5.0, “Ilya’s new debugger”, lib/perl5db.pl

l min+incr List incr+1 lines starting at min.
l min-max List lines min through max.
l line List single line.
l subname List first window of lines from subroutine.
l List next window of lines.

Lines 377-398 handle l subname by looking up the subroutine in the debugger’s magical structures, stacking an l start-end and falling through.
The w and - commands are implemented as special cases of l, stacking an l range command and falling through.
The l start+n command is also implemented with a stacked l range fallthrough.
Lines 415-432 actually implement the l range command. . is still specially processed as current line number, but $ is still ignored.

Complex match is unchanged.
Same
54d04a52ebeperl 5.0 beta 3

Same commands as last change.

l command code does not change in any significant way. The handling of the end of the range changes slightly, but there’s no new special case support.

First appearance of the leading - in the complex pattern. Nothing added to specifically support it.
/^l\b\s*((-?[\d\$\.]+)([-,]([\d\$\.]+)?)?/
492652be590perl 5.7.2

Major reorg of commands. Fallthrough code in mainline CMD loop moves to cmd_l.

LIne 1687 l – lists whole file (first occurrence)
Line 1688 l $name for subref list command.
Line 1696 implements l function_name
Line 1724 implements bare l command
Line 1728 implements l line+nnnn
Line 1734 retains the complex match for l start-end and l start,end. . is still special-cased. No change to the regex.
/^((-?[\d\$\.]+)([-,]([\d\$\.]+))?)?/
eda6e075b0cperl 5.9.X

Big documentation patch, no functional changes. Code moves a bit, but doesn’t change at all.
Same
d12a485175bperl 5.9.X

Retracts erroneous doc patch. No functional changes.
Same
69893cffeb6perl 5.9.x

Rearranges things a bit to minimize delta between 5.8 and 5.9 debuggers, but no functional changes.
Same
e22ea7cc8dbperl 5.9.x

Revert previous patch.
Same
613bf352ae2perl 5.10.0

“Extract _cmd_l_range“.

Beginning of the modern debugger architecture. Regex is unchanged, but the code is reorganized like this:
1) The regexes have been updated to remove the leading command character.
2) cmd_l parses the different commands, and each is handed off to a specific implementation routine. The complex regex routes to _cmd_l_range.
3) . is processed as current line in _cmd_l_range. $ is still unprocessed.
/^(?:(-?[\d\$\]+)(?:[-,]([\d\$\.]+))?)?/
be43a6d374eMinor changes to remove extra comments and streamline names. No other changes.Same
401da5225b7More mostly-cosmetic changes. No significant logic changes.Same
b7a96fc9f53More code rearrangement and streamlining of cmd_l. No logic changes.Same
Change history for the debugger l command.

So there are essentially four eras in the evolution of the debugger:

  • Era 1, the initial version in Perl 1 and Perl 2, which used an explicit simple range match.
  • Era 2, the appearance of the complex match in Perl 3.0.
  • Era 2.5, the transition to Perl 5.
  • Era 3, the extraction of commands in Perl 5.7.2 to separate subs.
  • Era 4, the modern (and testable) command architecture in Perl 5.10.0.

From looking over all these changes, it seems to me that what happened was that the pattern was intended to be a catchall for all the range-ish commands, but it wasn’t explicitly broken up so that it was clear exactly what those commands were supposed to be. My best guess is that Larry intended to implement l $var somehow in Perl 3; I believe assigning globs sort of worked like references at that point, and it would probably have been possible to expose the sub slot of the glob to the debugger…but this never happened.

Since the handling was all in one block with fallthroughs, creating the “match all the possible things” regex makes more sense in that context; I’m again guessing that the code was just going to ignore nonsense the way it does now, and that no one was expected to be silly enough to type floating point numbers or IP addresses at it.

The implicit conversion to integer when these were used as array indexes into the magic line array meant that it worked “well enough”. The increment of the variable containing the current line number cleaned it up enough that really dumb stuff like IP addresses and strings of periods were handled by implicit conversion to integers when looking for lines and didn’t cause enough of a problem to require that it be fixed.

I suppose you could be lazy and call it an “Easter egg”, but having all the weird things accepted feels sloppy.

What’s the “right thing”?

My best guess at what’s actually supposed to have been implemented sometime between Perl 3 and now is

  • l – : list the whole file. This does work, but probably should be called out as a specific case.
  • It’s likely that l -nnnn (list from here to line nnnn) was intended to work, since the match is there for it, but doesn’t.
  • It would seem logical that l nnnn- (list from line nnnn to EOF) should also be wanted and work, but also doesn’t.
  • Variations on using . as one or the other of the arguments: l nnnn,., (list from some previous line to the current one) and l .,nnnn (from here to some later line) probably ought to work. l .+5 should also work. Either these don’t work at all, or don’t work as one would expect.
  • Technically, l .,. should be legal and work, and do the same thing as l ., but admittedly it’s silly and I think we’d be forgiven if it didn’t. (It currently doesn’t.)
  • Bare $ is a mystery. You can enter it, and it matches, but it does nothing. There’s nothing in the code or comments that points to an intended use. Perhaps it was meant to be a magic “last line” substitute, or maybe it’s an artifact of “we should be able to match stuff in a scalat at some point” which ended up implemented as a completely separate case. Since bare $ has never worked, I think we’re OK dropping it.
  • Several other things match and the debugger tries to process them, but they’re nonsensical. l $$$$$$ is effectively l. l $fred$ matches but causes the debugger to throw an error; it should probably be rejected altogether.
  • There is no “I don’t like what you entered” clause. Either something weird happens, nothing happens, or there’s an error.

The problem is that complex regex right now matches far more than it ought to. We really need to establish what the acceptable forms are, and break that regex up so that it doesn’t match nonsense and cmd_l explicitly rejects nonsense. We’re only getting along because of Perl’s forgiving nature — but that means that peculiar things are accepted and (sort of) work.

Next steps

I’d suggest that we decide what the valid inputs are, and what they should do, and that the giant catchall regex should go away in favor of something much more like Larry’s original Perl 1 version.

I’m going to put a patch together in the short term that will simply not match most of the weird options, and add a “list range specification invalid” catchall else in _cmd_l_range or _cmd_l. We can decide if we want to add support for some of the more esoteric options (like l .,nnnn and so on) later; just making sure that we don’t assign nonsense to the start of the range that keeps propagating through further list commands seems like enough of a fix for now.

Reply