ETL into WordPress: lessons learned

I had a chance this weekend to do a little work on importing a large (4000 or so articles and pages) site into WordPress. It was an interesting bit of work, with a certain amount learning required on my part – which translated into some flailing around on to establish the toolset.

Lesson 1: ALWAYS use a database in preference to anything else when you can. 
I wasted a couple hours trying to clean up the data for CSV import using any of a number of WordPress plugins. Unfortunately, CSV import is half-assed at best – more like about quarter-assed, and any cleanup in Excel is excruciatingly slow.
Some of the data came out with mismatched quotes, leaving me with aberrant entries in the spreadsheet that caused Excel to throw an out-of-memory error and refuse to process them when I tried to delete the bad rows or even cells from those bad rows.
Even attempting to work with the CSV data using Text::CSV in Perl was problematic because the site export data (from phpMyAdmin) was fundamentally broken. I chalk that partially up to the charset problems we’ll talk about later.
I loaded up the database using MAMP, which worked perfectly well, and was able to use Perl DBI to pull the pages and posts out without a hitch, even the ones with weirdo character set problems.
Lesson 2: address character set problems first
I had a number of problems with the XMLRPC interface to WordPress (which otherwise is great, see below) when the data contained improperly encoded non-ASCII characters. I was eventually forced to write code to swap the strings into hex, find the bad 3 and 4 character runs, and replace them with the appropriate Latin-1 substitutes (note that these don’t quite match that table – I had to look for the ”e2ac’ or ‘c3′ delimiter characters in the input to figure out where the bad characters were. Once I hit on this idea, it worked very well.
Lesson 3: build in checkpointing from the start for large import jobs
The various problems ended up causing me to repeatedly wipe the WordPress posts database and restart the import, which wasted a lot of time. I did not count that toward the overall time needed to complete when I charged my client. If I had, it would have been more like 20-24 hours instead of 6. Fortunately the imports were, until a failure occurred, a start-it-and-forget-it process. It was necessary to wipe the database between tried because WordPress otherwise very carefully preserves all the previous versions, and cleaning them out is even slower.
I hit on the expedient of recording the row ID of an item each time one successfully imported and dumping that list out in a Perl END block. If the program fell over and exited due to a charset problem, I got a list of the rows that had processed OK which I could then add to an ignore list. Subsequent runs could simply exclude those records to get me straight to the stuff I hadn’t done yet and and to avoid duplicate entries.
I had previously tried just logging the bad ones and going back to redo those, but it turned out to be easier to exclude than include.
Lesson 4: WordPress::API and WordPress XMLRPC are *great*.
I was able to find the WordPress::API module on CPAN, which provides a nice object-oriented wrapper around WordPress XMLRPC. With that, I was able to programmatically add posts and pages about as fast as I could pull them out of the local database.
Lesson 5: XMLRPC just doesn’t support some stuff
You can’t add users or authors via XMLRPC, sadly. In the future, the better thing to do would probably be to log directly in to the server you’re configuring, load the old data into the database, and use the PHP API calls  directly to create users and authors as well as directly load the data into WordPress. I decided not to embark on this, this time, because I’m faster and more able in Perl than I am in PHP, and I decided it would be faster to go that way than try to teach myself a new programming language and solve the problem simultaneously.
Overall
I’d call this mostly successful. The data made it in to the WordPress installation, and I have an XML dump from WordPress that will let me restore it at will. All of the data ended up where it was supposed to go, and it all looks complete. I have a stash of techniques and sample code to work with if I need to do it again.

Bluetooth, LineIn, Soundflower: talking over Skype and playing music

Someone who wants to teach dance classes online asked me if there was a reasonable way (i.e.., without spending a lot of money) to set up a Skype link that can be used for both music and a wireless microphone setup.

The plan is to put something together that allows her to

  • Get far enough away from the camera that she can be seen head to toe (being able to see the footwork is important) and with a wide enough angle that she doesn’t have to dance unnaturally in one spot.
  • Send iTunes output and her voice over the line at the same time to one or more people,  in sync to the music.
  • Have some kind of a wireless mic to be able to communicate to her students without shouting.
  • Be able to hear her students talk back without their hearing their own voices delayed, or her hearing her own voice, delayed.

This turns out to be more complicated than it might seem. The iSight camera doesn’t work very well for this; its field of view is quite narrow, and it’s very difficult to adjust it so that it pointed properly on top of that. This was relatively easy to solve: a Logitech HD Pro 920 works fine for both the wide-angle and head-to-toe issues; it can be mounted on a tripod (it has the necessary threading to mount on a standard photo tripod), and after an upgrade to a more powerful laptop – her 2008 MacBook Air was just not cutting it! – the video issue was solved.

The audio issue was thornier. Originally, I hit up Sweetwater Sound for a real wireless mic setup; after realizing this was going to be well north of $300 once I got the mic, the base station, and the computer interface to actually hook it up with, and that this was going to be a lot of different hardware issues to deal with as well, I decided I’d better scout around for a better option.

I was stuck until the instructor suggested a Bluetooth headset instead. It’s a reasonable, good-enough audio input channel at 8KHz – she wants to talk across it, not record studio-quality audio, so a little bit tinny is OK – and it’s definitely wireless. After a bit of investigation, I settled on the Jawbone ERA as the most-likely-workable option. The ERA is light, small, fits tightly (important for a dancer) and is the current best headset suggestion from Wirecutter, who I have learned to trust on stuff like this. It’s easy to connect a Bluetooth headset to OS X (getting it to talk properly to the software’s a different issue, see below). This takes a lot of hardware complication out of the way. Skype supports Bluetooth, so I thought I’d solved the problem.

Unfortunately, an audio test with the music and voice both going through the Bluetooth mic showed me I’d have to get more creative; the music was either inaudible or distorted (that 8KHz bandwidth made it sound hideous, when you could hear it at all). It needed to be audible and undistorted if it was going to be possible for a student on the far end to use it to dance along with.

A lot of Googling finally led me to thisevilempire’s blog entry on how to play system audio in Skype calls on OS X. This got me part of the way: I had, according to tests with the Skype Audio Tester “number”, gotten the audio to play nicely across the link, but I was getting a half-second delay of my voice back on the same channel, which made it hard to talk continuously. Not good enough for an instructor.

More searching found a post on Lockergnome spelling out how to transmit clean audio, overlay voice,  and hear the returned call without an echo. Here’s how:

  1. Install Soundflower and LineIn, both free.
  2. Make sure the Bluetooth headset is on.
  3. Open the Sound preference pane in System Preferences.
  4. Set the
    1. Jawbone ERA as the input device
    2. Soundflower (64ch) as the output.
  5. Duplicate LineIn in the Applications folder, and rename both copies: one to “LineIn Bluetooth” and the other to “Bluetooth System”. The names aren’t important; this just so you can tell them apart.
  6. Launch both copies of LineIn. You’ll need to drag one window aside to reveal them both; they initially launch in exactly the same spot.
  7. Choose the “LineIn Bluetooth” instance in the Dock, and set
    1. Input to “ERA by Jawbone”
    2. Output to Soundflower (2ch).
    3. Click the “Pass thru” button.
  8. Select the other instance, “LineIn System”, and set
    1. Input to Soundflower (16ch)
    2. Output to Soundflower (2ch).
    3. Click the “Pass thru” button.
  9. Run Soundflowerbed (installed in the Applications folder by the Soundflower install). In the menu bar, click on the little flower icon, and
    1. Select “None” under Soundflower (2ch)
    2. Select “Built-in Output” under Soundflower (16ch).
  10. Run Skype, and open its preferences.
    1. Select “Soundflower 2ch” in its Microphone pulldown, and leave everything else alone.
    2. If you have an alternate camera attached, switch the Camera pulldown to the appropriate camera.

You should now be able to make a Skype call, and play music from iTunes, DVD Player, or Youtube over the wire at full fidelity, and talk at the same time. You should hear the far end’s voice on your  speakers, along with the music you’re sending across (undelayed).

Try to keep the headset away from the speakers to minimize the chances of feedback.

It’s not all that  difficult; it’s just the tricky bits of being able to reroute the audio internally via the two LineIn instances and Soundflower. Getting those tricky bits right is the difficult part.

I’ve tested this with the Skype test call and it seems to have worked; the big test will be the full-up video camera plus the streaming audio. We’ll give that a shot soon and I’ll follow up on whether the Bluetooth mic is good enough, or if a better mic is needed.

Update: Undoing the process!

It’s necessary to restore the normal audio routing after the call; you can do this with System Preferences.

  1. Open System Preferences and select Sound.
  2. Set Input to Internal Microphone. If you’re wearing the ERA, it will make a little descending bleep to let you know it’s been disconnected.
  3. Set Output to Internal Speakers.
  4. Quit both copies of LineIn.
  5. Check the Soundflowerbed menu; it should have both Soundflower 2ch and SoundFlower 64ch pointing to None. Quit Soundflowerbed.
  6. Turn off the Bluetooth headset; put it on its charger for a while.
  7. Quit Skype.

You should be all set.

Pure majority rule considered harmful

I’ve been discussing an issue on Perlmonks over the past couple days; specifically the potential for abuse of the anonymous posting feature. I’ve seen numerous threads go by discussing this, most of which have focused on restricting the anonymous user. Since the anonymous user’s current feature set seems to be a noli me tangere, I proposed an alternative solution similar to Twitter’s blocking feature. One of the site maintainers very cordially explained why my proposal was not going to be adopted, and in general I’d just let this drop – but I received another comment that I can’t just let pass without comment. To quote:

I’m saying “This isn’t a problem for the overwhelming majority, therefore it is not a problem.”

I’d like to take a second and talk about this particular argument against change, and why it is problematic. This is not about Perlmonks. This is not about any particular user. This is about a habit of thought that can be costly both on a job-related and personal level.

Software engineering is of necessity conservative. It’s impossible to do everything that everyone wants, therefore we have to find reasons to choose some things and not others. And as long as the reasons are honest and based on fact and good reasoning, then they are good reasons. They may not make everyone happy (impossible to do everything), but they do not make anyone feel as if their needs are not being carefully considered. But, because we’re all human, sometimes we take our emotional reactions to a proposal and try to justify those with a “reason” that “proves” our emotional reaction is right.

In this case, what is said here is something I’ve seen in many places, not just at Perlmonks: the assumption that unless the majority of the people concerned have a problem, there’s no good reason to change; the minority must put up with things as they are or leave. Secondarily, if there is no “perfect” solution (read: a solution that I like), then doing nothing is better than changing.

There is a difference between respectfully acknowledging that a problem exists, and taking the time to lay out why there are no good solutions within the existing framework, including the current proposal, as the maintainer did – and with which I’m satisfied – and saying “everyone else is happy with things as they are”, end of conversation.

The argument that the majority is perfectly happy with the status quo says several things by implication: the complainer should shut up and go along; the complainer is strange and different and there’s something wrong with them; they do not matter enough for us to address this.

Again, what I’m talking about is not about Perlmonks.

As software engineers, we tend to lean on our problem-solving skills, inventiveness, and intelligence. We use them every day, and they fix our problems and are valuable (they are why we get paid). This means we tend to take them not only to other projects, but into our personal lives. What I would want you to think about is whether you have accepted that stating “everyone else is happy with things as they are” is a part of your problem-solving toolkit. The idea that “the majority doesn’t have a problem with this” can morph into “I see myself as a member of the majority, so my opinions must be the majority’s opinions; since the majority being happy is sufficient to declare a problem solved, asserting my opinion is sufficient – the majority rule applies because I represent the majority”.

This shift can be poisonous to personal relationships, and embodies a potential for the destruction of other projects – it becomes all too easy to say the stakeholders are being “too picky” or “unrealistic”, or to assume that a romantic partner or friend should always think the same way you do because “most people like this” or “everybody wants this” or “nobody needs this” – when in actuality you like it or want it or don’t need it. The other person may like, need, or want it very much – and you’ve just said by implication that to you they’re “nobody” – that they don’t count. No matter how close a working or personal relationship is, this will sooner or later break it.

Making sure you’re acknowledging that what others feel, want, and need is as valid as what you feel, want, and need will go a long way toward dismantling these implicit assumptions that you are justified in telling them how they feel and what should matter to them.

youtube-dl: it just works

I was having trouble watching the Théâtre du Châtelet performance of Einstein on the beach at home; my connection was stuttering and buffering, which makes listening to highly-pulsed minimalist music extremely unrewarding. Nothing like a hitch in the middle of the stream to throw you out of the zone that Glass is trying to establish. (This is a brilliant staging of this opera and you should go watch it Right Now.)

So I started casting around for a way to download the video and watch it at my convenience. (Public note: I would never redistribute the recording; this is solely to allow me to timeshift the recording such that I can watch it continuously.) I looked at the page and thought, “yeah, I could work this out, but isn’t there a better way?” I searched for a downloader for the site in question, and found it mentioned in a comment in the GitHub pages for youtube-dl.

I wasn’t 100% certain that this would work, but a quick perusal seemed to indicate that it was a nicely sophisticated Python script that ought to be able to do the job. I checked it out and tried a run; it needed a few things installed, most importantly ffmpeg. At this point I started getting a little excited, as I knew ffmpeg should technically be quite nicely able to do any re-enoding etc. that the stream might need.

A quick brew install later, I had ffmpeg, and I asked for the download (this is where we’d gotten to while I’ve been writing this post):

$ youtube_dl/__main__.py http://culturebox.francetvinfo.fr/einstein-on-the-beach-au-theatre-du-chatelet-146813
 [culturebox.francetvinfo.fr] einstein-on-the-beach-au-theatre-du-chatelet-146813: Downloading webpage
 [culturebox.francetvinfo.fr] EV_6785: Downloading XML config
 [download] Destination: Einstein on the beach au Théâtre du Châtelet-EV_6785.mp4
 ffmpeg version 1.2.1 Copyright (c) 2000-2013 the FFmpeg developers
 built on Jan 12 2014 20:50:55 with Apple LLVM version 5.0 (clang-500.2.79) (based on LLVM 3.3svn)
 configuration: --prefix=/usr/local/Cellar/ffmpeg/1.2.1 --enable-shared --enable-pthreads --enable-gpl --enable-version3 --enable-nonfree --enable-hardcoded-tables --enable-avresample --enable-vda --cc=cc --host-cflags= --host-ldflags= --enable-libx264 --enable-libfaac --enable-libmp3lame --enable-libxvid
 libavutil 52. 18.100 / 52. 18.100
 libavcodec 54. 92.100 / 54. 92.100
 libavformat 54. 63.104 / 54. 63.104
 libavdevice 54. 3.103 / 54. 3.103
 libavfilter 3. 42.103 / 3. 42.103
 libswscale 2. 2.100 / 2. 2.100
 libswresample 0. 17.102 / 0. 17.102
 libpostproc 52. 2.100 / 52. 2.100
 [h264 @ 0x7ffb5181ac00] non-existing SPS 0 referenced in buffering period
 [h264 @ 0x7ffb5181ac00] non-existing SPS 15 referenced in buffering period
 [h264 @ 0x7ffb5181ac00] non-existing SPS 0 referenced in buffering period
 [h264 @ 0x7ffb5181ac00] non-existing SPS 15 referenced in buffering period
 [mpegts @ 0x7ffb52deb000] max_analyze_duration 5000000 reached at 5013333 microseconds
 [mpegts @ 0x7ffb52deb000] Could not find codec parameters for stream 2 (Unknown: none ([21][0][0][0] / 0x0015)): unknown codec
 Consider increasing the value for the 'analyzeduration' and 'probesize' options
 [mpegts @ 0x7ffb52deb000] Estimating duration from bitrate, this may be inaccurate
 [h264 @ 0x7ffb51f9aa00] non-existing SPS 0 referenced in buffering period
 [h264 @ 0x7ffb51f9aa00] non-existing SPS 15 referenced in buffering period
 [hls,applehttp @ 0x7ffb51815c00] max_analyze_duration 5000000 reached at 5013333 microseconds
 [hls,applehttp @ 0x7ffb51815c00] Could not find codec parameters for stream 2 (Unknown: none ([21][0][0][0] / 0x0015)): unknown codec
 Consider increasing the value for the 'analyzeduration' and 'probesize' options
 Input #0, hls,applehttp, from 'http://ftvodhdsecz-f.akamaihd.net/i/streaming-adaptatif/evt/pf-culture/2014/01/6785-1389114600-1-,320x176-304,512x288-576,704x400-832,1280x720-2176,k.mp4.csmil/index_2_av.m3u8':
 Duration: 04:36:34.00, start: 0.100667, bitrate: 0 kb/s
 Program 0
 Metadata:
 variant_bitrate : 0
 Stream #0:0: Video: h264 (Main) ([27][0][0][0] / 0x001B), yuv420p, 704x396, 12.50 fps, 25 tbr, 90k tbn, 50 tbc
 Stream #0:1: Audio: aac ([15][0][0][0] / 0x000F), 48000 Hz, stereo, fltp, 102 kb/s
 Stream #0:2: Unknown: none ([21][0][0][0] / 0x0015)
 Output #0, mp4, to 'Einstein on the beach au Théâtre du Châtelet-EV_6785.mp4.part':
 Metadata:
 encoder : Lavf54.63.104
 Stream #0:0: Video: h264 ([33][0][0][0] / 0x0021), yuv420p, 704x396, q=2-31, 12.50 fps, 90k tbn, 90k tbc
 Stream #0:1: Audio: aac ([64][0][0][0] / 0x0040), 48000 Hz, stereo, 102 kb/s
 Stream mapping:
 Stream #0:0 -> #0:0 (copy)
 Stream #0:1 -> #0:1 (copy)
 Press [q] to stop, [?] for help
 frame=254997 fps=352 q=-1.0 size= 1072839kB time=02:49:59.87 bitrate= 861.6kbits/s

Son of a gun. It works.

I’m waiting for the download to complete to be sure I got the whole video, but I am pretty certain this is going to work. Way better than playing screen-capture games. We’ll see how it looks when we’re all done, but I’m quite pleased to have it at all. The download appears to be happening at about 10x realtime, so I should have it all in about 24 minutes, give or take (it’s a four-hour, or 240 minute, presentation).

Update: Sadly, does not work for PBS videos, but you can actually buy those; I can live with that.

Test::Routine slides

This is my Test::Routine slide deck for the presentation I ended up doing from memory at the last SVPerl.org meeting. I remembered almost all of it except for the Moose trigger and modifier demos – but since I didn’t have any written yet, we didn’t miss those either!

 

The Node.js “he”/”they” Change: Analysis of a Social Bug

The Node.js foofaraw – concerning a fix meant to remove a “he” and switch it to a “they” – has gone all the way from a one-word patch to a monstrously-long comment chain on the patch and a core contributor resigning from the project.

The controversy continues a week later, with opinions ranging from “good riddance” to “how terrible people would make a good programer quit the project”. I’d like to step back and try to do what good programmers do when something fails in a spectacular way: look at what the situation was, what happened, and try to determine not only a cause but a way to prevent the issue in the future.

Rather than spend a lot of time on the deep analysis first, I’m going to go straight to my conclusion, and then illustrate why I think it’s true.

The social bug

The problem was neither completely a software problem, nor a social problem, but one caused by multiple confusions of software criteria for social ones (and vice versa), and of the essence of software with its representation, followed by not seeing the necessity of cohesion to help correct a community-wide problem.

Node.js is both a software project and a social group. There is code: an agreed-upon, human-intelligible means of communicating information about a set of designs and procedures to other humans, such that the chosen representation of that information can be turned into a different representation that can be executed by a computer. This is shared among the people who are working on it, and all of the people working on it submit proposed changes to a set of core committers who decide what goes in and what doesn’t based on their technical expertise, the quality of the submissions, and the overall goals of the project. So far so good.

Software, however,  is not only the expression of algorithms and design, but an expression of the community’s standards, especially when it is a public project. Because we are not computers ourselves, that communication will by necessity include desires, impulses, preconceived ideas, and all those other messy things that go along with being human. Some places the community or readers and writers will share nearly all the same ideas and goals; in others they will have large differences.

So it’s possible, even likely, that “good’ software – it executes properly, meets its design goals, it produces proper results – may communicate a personal or social message that raises a problem for members of the group on a personal level. This is a social bug.

Fixing a social bug

Fixing a social bug requires a very different set of talents and procedures than software debugging does. Among these are careful listening and a willingness to take enough time to reach an agreement, or at least an understanding; a willingness to accept that bad judgement and errors in solving a social bug can cause problems far worse than the original bug; and that sometimes the only tools that can fix them are personal responsibility and acceptance, with ensuing personal costs.

“Too small a change”

The Node.js failure occurred because Ben evaluated a social bug patch as a software patch. The specific change was a one-word change to a comment – a change to a comment is one of the clear signs that this was a human issue instead of a software one. Second, the change was gender-related. Most software developers during the current era are aware that a gender-related question is almost certainly going to be a social issue instead of a software one. Not seeing this and switching to a different problem-solving paradigm was the first error.

Causes for this first error are quite obscure. The very quick escalation of the problem caused by the lack of followup communication (see below) led to it being difficult to see what the proximate cause of the error was. It is possible that the initial evaluation of the change as insignificant was triggered by a cursory look at the patch: (paraphrasing) “one word in a comment? this isn’t worth it”, but we can’t say for sure.

The first error could have been avoided in a couple ways. If Ben had spotted this as a social issue immediately and had deployed social problem-solving immediately, it’s possible that this problem could have been resolved in a couple minutes. Possibly a lack of experience or training in dealing with social issues is the base reason for this particular failure; training, either formal or informal, in dealing with social issues is recommended to provide a base to work from.

“Works for me”

The second error occurred when other users filed “votes” for this social bug; they were attempting to communicate that the social problem was a problem for them as well, and these reports were seemingly ignored – there was no response for some time – or brushed off with a statement that the patch was not significant enough.

This failure can be summed up as a ‘works for me’ closure for a social bug, which, in an open source project, will more likely exacerbate the problem instead of fixing it. Closing a social bug as “works for me” communicates to the person reporting a social bug that the responder disregards the fact that the reporter is not the same as the responder, and that  the item complained about is not “working” for reporter; else it would not be being reported! “Works for me” for a social bug communicates “you’re taking this too seriously” or “this doesn’t mean anything, you should ignore it”.

The solution to this situation is to engage the reporters. Talk to them, find out their reasons for reporting the bug, take their input seriously. It may not make sense immediately, but it is critical to be seen as open, willing to listen, and accepting. You may need to say “I’m sorry, I had no idea this was the case.” Apologizing at this point is far easier than doing so after arguing against the reporters’ feelings and thoughts. Only after listening should you take any action. You should offer to listen in private so that persons who might feel at risk in speaking in public can feel safe in speaking to you. You may be on the receiving end of some anger and frustration; do your best to accept it as a communication of those feelings rather than responding to its face value. You do not have to be a doormat; you may ask for less emotionally-loaded communication, but only after acknowledging the sender has a right to those feelings and that you understand that they feel upset/angry/frustrated. Your job is to take all this in and return understanding.

Setting up a private conversation would have been ideal; a second-best would be to have said, “I can see this is more important to people than I thought; I understand this, but I’m still of the opinion this change by itself is smaller than we normally prefer to commit. Can we come up with a solution that expands the scope of the patch – maybe do an audit and clean it all up – and I’ll gladly commit that – or is there another possibility? Let’s talk about this – write me at XXXX@YYYY.ZZZ”.

“Consider yourself chided”

At this point, Isaac attempted to simply solve the social bug by merging the fix; unfortunately Ben apparently continued to view this as a software issue, and reverted the patch with comments about procedures and “chiding” Isaac, who was trying to head off the social train wreck. This sent the message (whether justified or not) that Ben had an agenda and was actively engaged in retaining the social bug, thereby escalating the bug from a small issue to a community-wide one of “what kind of message do the responsible members of the community want to send about this issue?”.

Several problems occurred here. A secondary social issue, no doubt amplified by the Joyent/Strongloop rivalry connected with Node.js, was aired in public instead of sorted out in private. The appearance of dissension among the core committers sent a bad social message – that the basic values of the community were indeed in conflict. This led to the airing of less and less productive attitudes and attacks.

Other persons at Ben and Isaac’s respective employers have explained that the issue was caused by Ben’s not understanding that the use of a gendered pronoun was so loaded. Perhaps this is true; given the amount of discussion of this issue over the past year or so, it seems unlikely. However, a number of people attempted to communicate that this really was an important issue. As far as can be seen, Ben did not engage with them when they tried to communicate this really was a big deal and that he should pay attention. It is always a failure in a social bug situation to appear to not care.

At that point, many different factions within the community, who before the bug was worsened into one of community principles had not even noticed the patch became involved. By this point the discussion had already spread to Twitter, pulling in other persons for whom this was indeed a social bug that mattered to them, myself included. It also pulled in a number of persons who were coming to the “defense” of the committer, further increasing the appearance of dissension in the ranks, and leading to YouTube levels of argument. In retrospect, joining the discussion was not productive, and I should not have done so. Trying private communications first would have been the right call; if there were no other way to communicate, trying to talk to Ben directly might have been acceptable; replying to people arguing with me was definitely not, and I should not have allowed myself to do so. (Again, my apologies to Isaac, who was trying to tamp down the social problem; I’m sorry to have made it harder on you.)

Many of the most rancorous discussions came out of trying to pretend that the software was an entity divorced from its human representation, and therefore social bug reports about the code were inane, hypocritical, or the result of ulterior motives (“white knight” was bandied around with vigor). Unfortunately there was no one at the upper levels of the Node.js informal hierarchy with the ability to choke off the argument (GitHub does not have a means of limiting discussion on a patch), and the core committers as a group were unable to, unwilling to, or simply did not think of establishing a united front and announcing a social bug solution. Isaac deployed a number of good social bug patches (language usage standards, acceptance of the patch, a definite statement that Node.js was committed to being inclusive), but the solidarity of the group had been damaged.

Solutions for this? When a social situation is spiraling out of control, the first task is to restore a consensus. It may be necessary to impose a cool-down period; discussion of the topic is barred in the public forum but encouraged privately. If a cool-down cannot be imposed (as in this case, where commenting could not be blocked), then the putative leaders must establish their own working consensus and reiterate it until it is clear that there is a consensus for now; that observations and complaints will be listened to and all points of view will be considered; that it is clear that there is a problem and that it does need to be fixed; and that the current decision is not necessarily the permanent last word on the subject, but it is the current decision of the leadership of the project, and that it is the end of the public discussion for now. Concerned parties are encouraged to talk to the leadership to help shape policy in this area.

Resignation

Ben has resigned form the project. I am sorry, as he has been a valued participant and has contributed a lot of code. This is the “everybody loses” solution to dissension; one person or another quits or is forced out.

In a hypothetical “everybody wins” version, the people who had the argument are required to resolve it – privately – and to come to an agreement. This may require one of, or all of, the participants to apologize: to each other, to the community, perhaps to others outside it, and the agreement is presented jointly by those who were arguing.

Any further discussion of the topic is cut off by the person on the “opposite” side: in this hypothetical instance, if someone was defending the initial refusal to commit, it would be Ben’s responsibility to step in and say, “we’ve resolved this, and we don’t need to discuss it further here. If you need to talk to us about it, write me a XXXXX@YYYYY.ZZZ.” If someone was saying, “Well, Isaac was right to override,”, then it would be Isaac’s responsibility to do the same. If someone simply is insisting on discussing feminism, or language, or someone’s motivations, any one of the participants should say “speaking for all of us, we’re done with this now; this is the policy. If you don’t like the policy, send your objections and suggest fixes to XXXXX@YYYYY.ZZZ.”

“Asshole”

During this period, various official entities published blog posts support for one committer (the Joyent “asshole”/”fire” post) or another (the Strongloop “second language” post); none of these did much except make one set of people happy and another unhappy.

The Joyent posting chose loaded language (e.g., “asshole”) to describe behavior; worse, “asshole” was not used in a way that made it clear that someone can act like an asshole, but that this does not necessarily mean that they are permanently and unreservedly an asshole. Certain behavior on the first committer’s part was socially inept and appeared condescending and somewhat hostile to an outside observer.

The only real solution, difficult as it is, to someone is calling you an asshole is to stop and re-evaluate your behavior to understand why they are saying this. If your re-evaluation of your actions causes you to realize you were wrong, then you need to say this. Even if your evaluation says you are right, something has caused the name-caller a problem, and for the continued social good health of the project, you need to figure out what it is. This will probably entail talking to someone who is good and mad at you, and it will probably be very uncomfortable. You may have to take timeouts from the conversation. You will probably have to apologize. You will almost certainly have to change your actions and probably your ideas, unless a neutral observer (not someone “on your side”) agrees that the name-caller really is off in na-na land.

Conclusions

It is, yes, a shame when knowledge leaves a project, or when someone loses their enthusiasm for it and gives up on it. It is not a shame that people were willing to stick their necks out and say, “I think that this decision does not reflect well on the project”, especially when some of those people have a lot to lose because of it. (I’ve been in a conversation where someone has actually offered the opinion that if a person using a particular ID is being verbally harassed at that ID, the right solution is for them to abandon that ID an move to another. Apparently the harassers shouldn’t have to do so.)

Persons who have a high profile in a public shared project do need to be willing to listen; to say they are sorry; to say thank you to someone who points out a mistake, no matter the language in which this is done. If you have inflicted a social bug’s results on someone, you don’t get to decide what reaction is appropriate; you don’t get to decide how many people are allowed to react; you don’t get to decide how someone is allowed to speak to you about it. You only get to decide whether or not to say something like “Holy crap. I didn’t realize. Thanks for telling me. I’m sorry about this.” If you decide not to, you may be acting like an asshole. If you always decide not to, you may be and asshole, for the purposes of people who observe this and then give up trying to interact with you.

When did world computing power pass the equivalent of one iPhone?

[This was originally asked on Quora, but the result of figuring this out was interesting enough that I thought I'd make it a blog post.]

It’s a very interesting question, because there are so many differing kinds of computing capability in that one device. The parallel processing power of the GPU (slinging bits to the display) and the straight-ahead FLOPs of the ARM processor.

Let’s try some back of the envelope calculations and comparisons.

The iPhone 5′s A6 processor is a dual-core, triple-GPU device. The first multiprocessor computer was the Burroughs D-285 (defense-only, of course).

Burroughs D-285 multiprocessor mainframe

A D-285 had 1 to 4 processors, running at ~ .070 s /1 operation = ~14 FLOPS for divide, the slowest operation, 166 FLOPS for add, the fastest, and ~25 FLOPS for multiply. Let’s assume adds are 10x more frequent than multiply and divide to come up with an average speed of 35 FLOPS per processor, so 70 FLOPS for a 2-processor D825, handwaving CPU synchronization, etc.

Let’s take the worst number from the Geekbench stats via AnandTech for the iPhone 5′s processor: 322 MFLOPS doing a dot product, a pure-math operation reasonably similar to the calculations being done at the time in 1962. Note that’s MFLOPS. Millions. To meet the worst performance of the iPhone 5 with the most optimistic estimate of a 2-processor Burroughs D825′s performance, you’d need 4.6 million of them.

I can state confidently that there were not that many Burroughs B362s available in 1962, so there’s a hard lower bound at 1962. The top-end supercomputer at that point was probably the IBM 7090, at 0.1 MFLOPS.

IBM 7090

We’d still have needed 3200 of those. in 1960, there were in total about 6000 computers (per IBM statistics – 4000 of those were IBM machines), and very few in the 7090 range. Throwing in all other computers worldwide, let’s say we double that number for 1962 – we’re still way behind the iPhone.

Let’s move forward. The CDC 7600, in 1969, averaged 10 MFLOPS (with hand-compiled code, and could peak at 35 MFLOPS).

CDC-7600

Let’s go with the 10 MFLOPS – to equal a single iPhone 5, you’d need 32 of them. Putting aside the once-a-day (and sometimes 4-5x a day) multi-hour breakdowns, we’re in the realm of possibility that the CDCs in existence alone at that time could equal or beat an iPhone 5 (assuming they were actually running), so the likelihood is that all computing in the world probably easily equalled or surpassed an iPhone 5 at that point in straight compute ability, making 1969 the top end of our range.

So without a lot of complicated research, we can narrow it down to somewhere in the seven-ish years between 1962 and 1969, closer to the end than the start. (As a note, the Cray-1 didn’t make the scene till 1975, with a performance of 80 MFLOPS, a quarter of an iPhone; in 1982, the Cray X-MP hit 800 MFLOPS, or 2.5 iPhones.)

And we haven’t talked about the GPUs, which are massively parallel processors the likes of which were uncommon until the 1980′s (and even the top-end graphics machines of the time 1962-1969 era couldn’t equal the performance of the iPhone’s GPU with weeks or months to work on rendering – let alone there not being output devices with the pixels per inch of the iPhone’s display capable of responding in real time). But on the basis of raw compute power, somewhere after the Beatles and before the moon landing. Making a finer estimate, I’d guess somewhere in late 1966, so let’s call it somewhere around the last Gemini mission, or Doctor Who’s first regeneration.

On rereading the question I saw that the asker wanted the numbers for an iPhone 4 instead of a 5. Given the amount of handwaving I’m doing anyway, I’d say we’re still talking about close to the same period but a bit later. Without actual numbers as to the computers in use at the time, which I don’t think I can dig up without much more work than I’m willing to do for free, it’s difficult to be any closer than a couple years plus or minus. Definitely before Laugh-In (1968), definitely after the miniskirt (1964).

iPhone 5s update: the 5s is about 1.75 times faster than the 5, so that puts us at a rough 530 MFLOPS. The computing power estimate becomes much harder at this point, as minicomputers start up about 1969 (the PDP-11 and the Data General Nova). The Nova sold 50,000 units, equivalencing out to about 130 MFLOPS; total PDP-11′s sold “during the 1970′s” was 170,000 for a total of 11 GFLOPS (based on the 11/40 as my guess as to the most-often-sold machine); divide that by ten and then take half of that for a rough estimate, and the PDP-11s by themselves equivalence to one 5s. So I’ll say that the moon landing was probably about the equivalence point for the 5s, but the numbers are much shakier than they are for the 4 or 5, so call it around the first message sent over ARPANet at the end of October 1969. (Side note: this means that the average small startup in Silicon Valley today – 20 or so people –  is carrying about the equivalent power of all the PDP-11′s sold during the 1970′s in their pockets and purses.)

Past this, world computing power is too hard to track without a whole lot of research, so take this as the likely last point where I can feel comfortable making an estimate.

Intro to Perl Testing at SVPerl

A nice evening at SVPerl – we talked about the basic concepts of testing, and walked through some examples of using Test::Simple, Test::More, and Test::Exception to write tests. We did a fair amount of demo that’s not included in the slides – we’ll have to start recording these sometime – but you should be able to get the gist of the talk from the slides.

CrashPlan folder date recovery

The situation: a friend had a MacBook Air whose motherboard went poof. Fortunately she had backups (almost up-to-date) in CrashPlan, so she did a restore of her home directory, which worked fine in that she had her files, but not so fine in that all the folder last-changed dates now ran from the current date to a couple days previous (it takes a couple days to recover ~60GB of data).

This was a problem for her, because she partly uses the last-changed date on her folders to help her keep organized. “When was the last time I did anything on project X?” (I should note: she uses Microsoft Word and a couple different photo library managers, so git or the equivalent doesn’t work well for her workflow. She is considering git or the like now for her future text-based work…)

A check with CrashPlan let us know that they did not track folder update dates and couldn’t restore them. We therefore needed to come up with a way to re-establish as best we could what the dates were before the crash.

Our original thought was simply to start at the bottom and recursively restore the folder last-used dates using touch -t, taking the most-recently-updated file in the folder as the folder’s last-updated date. Some research and thought turned up the following:

  • Updating a file updated the folder’s last-updated date.
  • Updating a folder did not update the containing folder’s last-updated date.

This meant that we couldn’t precisely guarantee that the folder’s last-updated date would accurately reflect the last update of its contents. We decided in the end that the best strategy for her was to “bubble up” the last-updated dates by checking both files and folders contained in a subject folder. This way, if a file deep in the hierarchy is updated, but the files and folders above it have not been, the file’s last-updated date is applied to its containing folder, and subsequently is applied also to each containing folder (since we’re checking both files and folders, and there’s always a folder that has the last-updated date that corresponds to the one on the deeply-nested file). This seemed like the better choice for her as she had no other records of what had been worked on when, and runs a very nested set of folders.

If you were running a flatter hierarchy, only updating the folders to the last-updated date of the files might be a better choice.  Since I was writing a script to do this anyway, it seemed reasonable to go ahead and implement it so that you could choose to bubble up or not as you liked, and to also allow you to selectively bubble-up or not in a single directory.

This was the genesis of date-fixer.pl. Here’s the script. A more detailed example of why neither approach to restoring the folder dates is perfect is contained in the POD.

use strict;
use warnings;
use 5.010;
 
=head1 NAME
 
date-fixer.pl - update folder dates to match newest contained file
 
=head1 SYNOPSIS
 
date-fixer.pl --directory top_dir_to_fix
             [--commit]
             [--verbose]
             [--includefolders]
             [--single]
 
=head1 DESCRIPTION
 
date-fixer.pl is meant to be used after you've used something like CrashPlan
to restore your files. The restore process will put the files back with their
proper dates, but the folders containing those files will be updated to the
current date (the last time any operation was done in this folder -
specifically, putting the files back).
 
date-fixer.pl's default operation is to tell you what it would do; if you want
it to actually do anything, you need to add the --commit argument to force it
to actually execute the commands that change the folder dates.
 
If you supply the --verbose argument, date-fixer.pl will print all the commands
it is about to execute (and if you didn't specify --includefolders, warn you
about younger contained folders - see below). You can capture these from STDOUT
and further process them if you like.
 
=head2 Younger contained folders and --includefolders
 
Consider the following:
 
    folder1           (created January 2010 - date is April 2011)
        veryoldfile 1 (updated March 2011)
        oldfile2      (updated April 2011)
        folder2       (created June 2012 - date is July 2012)
            newfile   (updated July 2012)
 
If we update folder1 to only match the files within it, we won't catch that
folder2's date could actually be much more recent that that of either of the
files directly contained by folder1. However, if we use contained folder dates
as well as contained file dates to calculate the "last updated" date of the
current folder, we may make the date of the current folder considerably more
recent than it may actually have been.
 
Example: veryoldfile1 and oldfile2 were updated in March and April 2011.
Folder2 was updated in June 2012, and newfile was added to in in July 2012.
The creation of folder2 updates the last-updated date of folder1 to June 2012;
the addition of newfile updates folder2's last-updated date to that date --
but the last-updated date of folder1 does not change - it remains June 2012.
 
If we restore all the files and try to determine the "right" dates to set the
folder update dates to, we discover that there is no unambiguous way to decide
what the "right" dates are. If we use the file dates, alone, we'll miss that
folder2 was created in June (causing folder1 to update to June); if we use
both file and folder dates, we update folder1 to July 2012, which is not
accurate either.
 
date-fixer.pl takes a cautious middle road, defaulting to only using the files
within a folder to update that folder's last-modified date. If you prefer to
ensure that the newest date buried in a folder hierarchy always "bubbles up"
to the top, add the --includefolders option to the command.
 
date-fixer will, in verbose mode, print a warning for every folder that
contains a folder younger than itself; you may choose to go back and adjust
the dates on those folders with
 
date-fixer.pl --directory fixthisone --includefolders --single
 
This will, for this one folder, adjust the folder's last-updated date to the
most recent date of any of the items contained in it.
 
=head1 USAGE
 
To fix all the dates in a directory and all directories below it, "bubbling
up" dates from later files:
 
    date-fixer.pl --directory dir --commit --includefolders
 
To fix the dates in just one directory based on only the files in it and
ignoring the dates on any directories it contains:
 
    date-fixer.pl --directory dir --commit --single
 
To see in detail what date-fixer is doing while recursively fixing dates,
"bubbling up" folder dates:
 
    date-fixer.pl --directory dir --commit --verbose --includefolders
 
=head1 NOTES
 
"Why didn't you use File::Find?"
 
I conceived the code as a simple recursion; it seemed much easier to go ahead and read the directories
myself than to go through the mental exercise of transforming the treewalk into an iteration such as I
would need to use File::Find instead.
 
=head1 AUTHOR
 
Joe McMahon, mcmahon@cpan.org
 
=head1 LICENSE
 
This code is licensed under the same terms as Perl itself.
 
=cut
 
use Getopt::Long;
use Date::Format;
 
my($commit, $start_dir, $verbose, $includefolders, $single);
GetOptions(
    'commit' => \$commit,
    'directory=s' => \$start_dir,
    'verbose|v' => \$verbose,
    'includefolders' => \$includefolders,
    'single' => \$single,
);
 
$start_dir or die "Must specify --directory\n";
 
set_date_from_contained_files($start_dir);
 
sub set_date_from_contained_files {
    my($directory) = @_;
    return unless defined $directory;
 
    opendir my $dirhandle, $directory
        or die "Can't read $directory: $!\n";
    my @contents;
    push @contents, $_ while readdir($dirhandle);
    closedir $dirhandle;
 
    @contents = grep { !/\.$|\.\.$/ } @contents;
    my @dirs = grep { -d "$directory/$_" } @contents;
 
    my %dirmap;
    @dirmap{@{[@dirs]}} = ();
 
    my @files = grep { !exists $dirmap{$_}} @contents;
 
    # Recursively apply the same update criteria unless --single is on.
    unless ($single) {
        foreach my $dir (@dirs) {
            set_date_from_contained_files("$directory/$dir");
        }
    }
 
    my $most_recent_date;
    if (! $includefolders) {
         $most_recent_date = most_recent_date($directory, @files);
         my $most_recent_folder = most_recent_date($directory, @dirs);
         warn "Folders in $directory are more recent ($most_recent_folder) than the most-recent file ($most_recent_date)\n";
    }
    else {
         $most_recent_date = most_recent_date($directory, @files, @dirs);
    }
 
    if (defined $most_recent_date) {
        (my $requoted = $directory) =~ s/'/\\'/g;
        my @command = (qw(touch -t), $most_recent_date, $directory);
        print "@command\n" if $verbose;
        system @command if $commit;
    }
    else {
        warn "$directory unchanged because it is empty\n" if $verbose;
    }
}
 
sub most_recent_date {
    my ($directory, @items) = @_;
    my @dates =     map  { (stat "$directory/$_")[9] } @items;
    my @formatted = map  { time2str("%Y%m%d%H%M.%S", $_) } @dates;
    my @ordered =   sort { $a lt $b } @formatted;
    return $ordered[0];
}