Building ‘use English;’ into the Perl core

Perl has a list of “stuff we really want to add to the language that needs someone to code it in C,” called PPCs. PPC 14 adds English aliases to Perl’s controversial “punctuation” variables (like $", $?, $., etc.), and I’ve decided to try taking this one on.

I know some of the internals stuff from a long-ago class at a Perl conference, and from Jarkko’s chapter on the internals in Advanced Perl Programming, but this is the first time I’ve actually dived into serious C programming other than a fling with Objective-C back in the day…and I kind of like it.

C absolutely is just enough tool to get the job done, and I’ve actually kind of missed that. Most of the work I was doing at Zip toward the end of my time there was all Scala, and Scala is a nice language but it’s…heavyweight. Takes an age to build and test. Even with a fairly big recompile of the whole interpreter, an edit-build-test cycle is pretty fast in C.

The working experience is a lot like Go, just with way fewer guardrails.

The coding experience, however, is very Zen: a series of enlightenments is necessary to proceed. I have looked at the Perl code a little before, but this project is much more complex than anything I’ve tried before in C. It’s really a process of reading the code, reasoning about it, taking a shot at something, discovering it was more complex that you thought, and continuing until the light dawns.

First enlightenment

I started off looking at the XS code in English::Name to see if I could out-and-out steal it. Unfortunately not, but it did start giving me some hints as to what I could do.

(At this point, I’m going to start talking about Perl internals, and this may become a lot less clear. Sorry about that.)

Each variable in Perl is represented by a “glob” held in a symbol table hash. A glob is a data structure that can hold all the different types of thing a given name can be — this is why, in Perl, you can have $foo and @foo and %foo all at once, because one glob (also known when working on the internals as a GV – “glob value”, I believe) can hold pointers to each kind of variable.

I started out wondering if I could just alias the names directly when I saw them in Perl. For some of the read-only special scalar variables, you can do this by overwriting the SV (scalar variable) slot in the GV with a pointer to the aliased variable’s SV.

The gv.c file contains the code that works with global variables, and the function S_gv_magicalize contains a big switch statement that parses the incoming variable names and then uses the sv_magic function to install hooks that are called when the variable is accessed (read or written). So the easiest, dumbest option is to try just sharing the SV that is created for the variable I want to alias with the new name.

The code in S_gv_magicalize is essentially one big old switch statement; it uses a function called memEQs to check the incoming name against variable name strings to see if we should process the variables. The new variables I want to add all look like ${^SOMETHING}; this lets us look English-like, but looks different so we remember that this is a special variable. The code that parses the names converts the letter prefixed with a caret into a control character, so (say) ^C becomes \003; ^SOMETHING would be \023OMETHING, so that’s the string we plug in to memEQs:

if (memEQs(name, len, "\023OMETHING)) ...

Good, so we have a way to match the variables we’re interested in; now we just need to figure out how to alias the SVs. Poking around, I figured out that if I could find the target variable in the main:: symbol table, I could use a few of the macros that the Perl source provides to find the SV pointer in the old variable, and then assign it to the SV slot in the new GV I was creating. I realized I’d be doing this a lot, so I wrote a preprocessor macro of my own to do this. This was a bit tricky, because I needed to not just substitute a string into the get_sv call, but actually contatenate it into the string. Some Googling found me the C99 # operator that does the trick. Here’s that macro:

#define SValias(vname) GvSV(gv) = newSVsv(get_sv("main::"#vname, 0))

This tells Perl to look in the main:: symbol table for the variable whose name I’ve concatenated into the fully-qualified name and extract the contents of the SV slot for that variable. I then call newSVsv (build a new SV I can use out of this SV) and then assign it to the SV slot in the brand new GV that I’m building.

Easy-peasy, add the aliases for all the variables…and this worked for a certain portion of the variables, but didn’t work at all for others. There also didn’t seem to be rhyme nor reason why this should work for some but not others.

Second enlightenment

I dived back into the code, and read it all through again. There were a lot of goto magicalize statements; (almost — of course, almost, why make it easy?) every special variable ends up jumping to this label, which calls

 sv_magic(GvSVn(gv), MUTABLE_SV(gv), PERL_MAGIC_sv, name, len);

Well. What does that do? Going over to mg.c, where this function is defined, it takes a GV, an SV from that GV, both of which will be modified, and the last three parameters define the kind of magic to add, and name and len define the name being passed. Those are already set when we get here in gv.c, so my understanding at this point (yes, another enlightenment is needed!) was, “okay, we have a GV, and we’re passing a name and length, so this must be keying off the name to assign the right magic. Obviously if I can pass the GV I have but a different name and len, then the Right Thing will happen in mg.c and this will work perfectly.”

So I tried a couple other variations to try to get remagicking the variable to work.

  1. Adding a block of code right below the sv_magic call to try to reassign the magic. This didn’t work; the call got made, but the variable did not have any magic.
  2. Passing a hardcoded alternate name and length to sv_magic. Also had no detectable effect.
  3. Refactor the code in mg.c so that I could create a new function that would allow me to pass a second name and len, so that I could do the reassignment inside mg.c instead. This also didn’t work, but not because the concept was wrong; I simply could not get the code to compile, because something in the macros was convinced that I should pass one more argument to the call to the refactored code, even though I wasn’t changing the calling sequence at all.

I spent about a half-hour trying different variations of function calls and naming, and decided that was long enough; I needed to look again and see what was going on deeper down…and maybe find a way that was more compatible with the code already there.

(Note: I did not want to change the calling sequence for sv_magic, or change its return value, because this would have been a change to the Perl API, potentially breaking lots of XS code, and potentially propagating lots of changes all over the Perl codebase itself.)

Third enlightenment

I went back to mg.c again and instead of looking at the code that applied the magic, I went to look at the code that implemented it instead. Reading through all of mg.c, and rereading gv.c, I found that the magic was implemented two different ways.

  • Some variables were set up directly in gv.c, in S_gv_magicalize. These were the variables that I’d been successful in aliasing with the SValias macro; they were read-only, and hard-linked to unchanging data.
  • The rest were set up in mg.c; they were detected as magic in gv.c, in S_gv_magicalize, which then jumped to the sv_magic call to pass the actual assignment of the magic to the SV.

In mg.c, there are two different functions, Perl_magic_get and Perl_magic_set, which handle the magic for getting and setting the SV. (There are a bunch more Perl_magic functions, and it’s definitely possible I’ll need to learn more about those, but my current knowledge seems to indicate that these two are enough to do the implementation of the English variables.) We do the same kind of matching against names to decide what magic applies to the variable, and then execute the appropriate code to make the magic happen. This made sense based on what I knew already, and confirmed that the attempts to set a different name for the sv_magic call were not wrong; I just didn’t manage to implement something that did it properly.

Given this, I decided to try implementing the English variations on two different variables: one a simple fixed read-only one implemented only in gv.c, and a second read-write one implemented in the Perl_magic_get and Perl_magic_set functions in mg.c to see if I’d actually understood the code.

I also chose to go with the paradigm I’d seen throughout these big case statements: do the cases in alphabetical order, and use goto to jump to existing code that already implemented the feature. These gotos are always forward jumps, so they’re not quite so bad, but writing hard branches in code again certainly took me back a ways.

Magic variable in gv.c alone: $] aliasing to ${^OLD_PERL_VERSION}

$] provides the older floating-point representation of the Perl interpreter’s version. Looking at gv.c, there’s a block of code that looks like this:

         case ']':               /* $] */
         {

             SV * const sv = GvSV(gv);
             if (!sv_derived_from(PL_patchlevel, "version"))
                 upg_version(PL_patchlevel, TRUE);
             GvSV(gv) = vnumify(PL_patchlevel);
             SvREADONLY_on(GvSV(gv));
             SvREFCNT_dec(sv);
         }
         break;

We fetch the SV already in the variable; if it’s not already the version, then we make it the version, turn it into a number, stash it in the GV, make it readonly, and then decrement the refcount of this GV’s SV to prevent multiple frees of this data during global destruction at the end of the program.

To implement ${^OLD_PERL_VERSION}, we need to catch it, and then do a goto to this code. Here’s the patch:

| diff --git a/gv.c b/gv.c
| index 93fc37da63..6c00b050db 100644
| --- a/gv.c
| +++ b/gv.c
| @@ -2231,7 +2231,9 @@ S_gv_magicalize(pTHX_ GV *gv, HV *stash, const char *name, STRLEN len,
|                      goto storeparen;
|                  }
|                  break;
| -            case '\017':        /* ${^OPEN} */
| +            case '\017':        /* ${^OPEN}, ${^OLD_PERL_VERSION} */
| +                if(memEQs(name, len, "\017LD_PERL_VERSION"))
| +                    goto old_perl_version;
|                  if (memEQs(name, len, "\017PEN"))
|                      goto magicalize;
|                  break;
| @@ -2430,7 +2432,9 @@ S_gv_magicalize(pTHX_ GV *gv, HV *stash, const char *name, STRLEN len,
|              sv_setpvs(GvSVn(gv),"\034");
|              break;
|          case ']':            /* $] */
| +          old_perl_version:
|          {
| +
|              SV * const sv = GvSV(gv);
|              if (!sv_derived_from(PL_patchlevel, "version"))
|                  upg_version(PL_patchlevel, TRUE);

It’s very straightforward; just reuse the code we have for $] for ${^OLD_PERL_VERSION} with a goto to that code. Tests show it works as expected:

And running it:




Comments

Leave a Reply