Code speaks

Enlightening perl's documentation (you too can help!)

2010-04-09T12:50:00.000+02:00

(This entry was first posted here)

Enlightened perl programmers often complain about how outdated most online perl manuals. Truth is, some of the official documentation is quite outdated too. Perl ships with a lot of documentation, some of it is old and badly needing some maintenance.

For example, a quick ack through the documentation showed about 250 cases of open used with a glob as its first argument (e.g. open FOO;), even one in perlstyle! I think everyone agrees that in 2010 that's no longer a good example. I don't think it has a place outside of perlfunc and perlopentut. The same argument goes for two argument open and use vars. I'm sure there are plenty of other issues that I can't think of right now.

That's not hard to fix, in fact you could almost write a program for it. It may be a lot of work though to fix all of them, but not so much to fix them one by one.

Some things are harder. Some things are a lot more work. Let's take some docs on object oriented programming perlbot, perltoot and perltooc for example haven't seen major updates since 1995, 1997 and 1999 respectively. perlboot and perlobj seem to have had the most loving the past decade, but can still use some more attention. Surely we all agree a lot has happened in object oriented perl in the mean time. We don't directly assign to @ISA anymore in our modules, do we?

But there's much more to improve in the documentations.

Why does perltrap list traps for perl4, awk and sed but not for modern languages like python, php or ruby?
Why doesn't perlipc use IO::* instead of low level functions?
Why isn't perlmodinstall Build.PL aware?

I suspect many other pieces of documentation that I haven't taken a good look at that could also use a spring (autumn for inhabitants of the southern hemisphere) cleaning.

The good part of the story? You can help. This work can be done not only incrementally, but also distributed.

Edited to add: So how to get started? perlrepository has detailed information how to submit patches. The easiest way is to fork perl at github and when you're done mail the perl-5-porters about it.

DynaLoader considered harmfulharmful

2010-03-28T12:45:00.000+02:00

(This entry was first posted here)

DynaLoader is a portable high-level interface around you OS's dynamic library loading. It's the code that's loading your XS modules. It's actually doing a pretty good job at that. You may wonder then why I consider its use harmful.

If all you want to do is load the XS part of your module, it's the wrong tool for the job. Most of all because it has a truly awful interface. It requires you to inherit your module from it. It's common knowledge that public inheritance from an implementation detail is a really bad idea. It breaks not only encapsulation rather badly, but also violates separation of concerns.

This would be as bad as it is if DynaLoader didn't use AutoLoader. Because of this, when you call some undefined method on an instance of a class that derives from DynaLoader you don't get this error:

Can't locate object method "undefined_method" via package "Foo"

But this rather cryptic error:

Can't locate auto/Foo/undefined_m.al in @INC (@INC contains: /etc/perl /usr/local/lib/perl/5.10.0 /usr/local/share/perl/5.10.0 /usr/lib/perl5 /usr/share/perl5 /usr/lib/perl/5.10 /usr/share/perl/5.10 /usr/local/lib/site_perl .)

No way a perl novice will understand what's going on there!

Worst part may be that this interface buys us very little in practice. The inheritance is used only once in the DynaLoader code, to call the dl_load_flags function. Surely there has to be a better way to pass on that one bit of data!

One solution to this is to simply encapsulate the module loading to a different module. This is a working approach, but then you're reimplementing a module that has been in the perl core for a decade now: XSLoader. It's not perfect, but it will cover 98% of all XS user's needs with significantly less disadvantages.

Honestly, there are valid uses of DynaLoader, but standard XS modules just aren't one of them. Use XSLoader or if that doesn't suit you write a patch for it or a better wrapper and put it on CPAN*, but don't use DynaLoader directly.

* I might even do that myself.

Threads in Perl, Erlang style

2010-03-24T23:41:00.001+02:00

Adam Kennedy's recent post on threads in Padre reminded me to post about an experiment of mine. Last year I learned some Erlang. I really liked their model of multi-threading: many threads that share no data at all and communicate through message queues. A lot of other things where really annoying though, specially their crappy support for strings and lack of libraries in general. I kept thinking I want perl with these threads, so I started implementing it. And thus threads::lite was born.

The main difference between threads::lite and threads.pm is that t::l starts an entirely new interpreter instead of cloning the existing one. If you've loaded a lot of modules, that can be significantly quicker and leaner than cloning. As an optimization, it supports cloning itself right after module loading, so you can quickly start a large number of identical threads. Threads can be monitored, so that on thread death their send an exit code to their listeners.

Every thread has it's own message queue. Every thread can send messages to any thread whose thread-id it knows. Any kind of data that can be serialized using Storable can be send, including coderefs.

Receiving messages is done on basis of pattern matching (based on smart matching). This can range from a simple null-pattern (matches anything, so returns the front message on the queue) to complex tables of patterns.

I've started building some high level abstractions on top of it, most notably a implementation of a parallel map and grep. I'd like to do a map-reduce, but that's for a later stage.

This is very much an experimental module. I'm still not sure perl is really suitable for true erlang style threading: I suspect perl interpreters are to heavy to have a 100000 of them in one process, but it would be interesting to try…

Prototypes: the good, the bad,and the ugly

2009-08-26T22:43:00.000+02:00

This is a reply to chromatic's post The Problem with Prototypes, but his blog didn't allow me to post it there so I post it here.

Pretty much everyone agrees that there are good (such as blocks and sometimes reference) prototypes and bad ones (scalar most of all). Few discuss the third class: the ugly glob prototype.

perlsub describes them as such:

A * allows the subroutine to accept a bareword, constant, scalar expression, typeglob, or a reference to a typeglob in that slot. The value will be available to the subroutine either as a simple scalar, or (in the latter two cases) as a reference to the typeglob.

In other words, they are the same as scalar prototypes, except that they also accept globs and barewords. This is mainly used to pass filehandles, like this:

sub foo (*) {...} foo STDIN;

but in fact it can be used to pass any bareword to function, as it leaves the interpretation of it to the function.

It's tempting to call this bad, but it offers some API possibilities that would otherwise not be possible, hence I would call it ugly rather than bad per se.

Casting magic against segfaults

2009-02-09T14:01:00.003+02:00

The problem

For years, there has been the Sys::Mmap module, however, it has a few issues. For example, let's take this piece of code:

use Sys::Mmap; open my $fd, '+>', 'filename'; mmap my $var, -s $fd, PROT_READ|PROT_WRITE, MAP_SHARED, $fd, 0; $var = 'Foobar'; munmap $var;

First of all, it's simply not user-friendly. mmap takes 6 arguments in a weird order, and uses weird constants. Also munmap shouldn't be necessary: variables should dispose of themselves when they run out of scope.

But more importantly, this program does not do what you think it does, though the only hint of that is an Invalid argument exception when doing munmap. During the assignment, the link between the mapping and the variable is lost, so nothing is written to the file. Worse yet, this can even lead to a segfault in some circumstances.

Ouch!

Tying things up?

The documentation clearly says that you shouldn't do this (or anything else that changes the length of the variable), but IMHO this hole shouldn't be left open in the first place, if only because it is extremely counterintuitive (and thus a maintenance nightmare). Modules should fail more gracefully than this.

Sys::Mmap offers a tied interface as compensation, but this didn't work out. The tied interface indeed is safe, but it creates another problem.

Every time it is read, it copies the whole file into the variable. Every time the variable is modified, it writes the whole new value to the file, even if the change only affects a single byte.

Ouch!

Obviously, that doesn't scale at all. One user of the module reported a 10-20 times slowdown of his program after converting to ties. That's not a workable solution.

The solution

Perl has a powerful but rarely used feature called magic. (It's rare use by module authors is indicated by the fact that the prototypes of the magic virtual table as documented in perlguts aren't even complete: they lack pTHX_'s). They are used by the perl core to implement magic variables such as $! and ties (surprise, surprise). It offers 8 hooks into different stages of handling a variable, the three most important being fetching(svt_get), storing(svt_set) and destruction(svt_free).

In my case, I didn't need get magic, but I did use set and free magic. Freeing the variable is not that interesting (simply unmapping the variable), but setting it is. This function is called just after every write to the variable:

static int mmap_write(pTHX_ SV* var, MAGIC* magic) { struct mmap_info* info = (struct mmap_info*) magic->mg_ptr; if (SvPVX(var) != info->address) { if (ckWARN(WARN_SUBSTR)) warn("Writing directly to a to a memory mapped file is not recommended"); Copy(SvPVX(var), info->address, MIN(SvLEN(var) - 1, info->length), char); SvPV_free(var); reset_var(var, info); } return 0; }

This function is called after every write to the variable to check if the variable is still linked to the map. If it isn't, it copies the new value into the map, frees the old value and restores the link. As copying is potentially expensive, it will issue a warning if warnings (or actually, 'substr' warnings) is in effect.

There is no perfect solution to this problem, but getting a friendly warning is undeniably better than getting a segmentation fault or data loss.

Anyway, you can find Sys::Mmap::Simple here. It offers more goodies, such as portability to Windows, a greatly simplified interface, and built-in thread synchronization.

Elegance in minimalism

2009-01-29T17:25:00.009+02:00

Some time ago, I read this journal entry by Aristotle. I liked it and suspected it could be easily implemented in XS. It turned out to be the most elegant piece of XS I've ever written.

void
induce(block, var)
    SV* block;
    SV* var;
    PROTOTYPE: &$
    PPCODE:
        SAVESPTR(DEFSV);
        DEFSV = sv_mortalcopy(var);
        while (SvOK(DEFSV)) {
            PUSHMARK(SP);
            call_sv(block, G_ARRAY);
            SPAGAIN;
        }

I assume most readers don't know much C, let alone the perl API or XS, so I'll explain it piece by piece.

void induce(block, var) SV* block; SV* var; PROTOTYPE: &$
This declares the xsub. It has two parameters, both scalar values. The function has the prototype &$. So far little surprises.

PPCODE:
This declares that a piece of code follows. Unlike CODE blocks, PPCODE blocks pop the arguments off the stack at the start. This turns out to be important later on.

SAVESPTR(DEFSV); DEFSV = sv_mortalcopy(var);
These lines localizes $_ and initializes it to var.

while (SvOK(DEFSV)) {
This line is equivalent to while (defined($_)).

Now comes the interesting part:
PUSHMARK(SP); call_sv(block, G_ARRAY); SPAGAIN;
To understand what this block does, you have to know how perl works inside.

If you've read perlxs, you may notice this function does not push any values on the stack. Are careless reader might be mistaken and think this function doesn't return anything: they couldn't be more wrong!

If you've read perlcall, you would notice a lot more is missing. For starters, the function calls SPAGAIN (pretty much equivalent to saying I accept the values you return to me), but it doesn't do anything with them.
Also, you may notive that both ENTER/LEAVE(needed to delocalize $_) and SAVETMPS/FREETMPS (needed to get rid of temporary values) are missing. The function that calls the xsub automatically surrounds it by an ENTER/LEAVE pair, so that one isn't necessary. The lack of SAVETMPS/FREETMPS however is not only deliberate but also essential.

The loop calls the block without arguments (PUSHMARK & call_sv). The xsub accept the return values on the stack and leaves them there. This sequence repeated. This way it assembles the induces values on the stack. PPCODE removing the arguments at the start prevents it from returning those as first two return values. It also adds a trailer that causes all elements that have been pushed on the stack to be recognized as return values of this function. That's why a SAVETMPS/FREETMPS pair would break this code: the values must live after the code returns.

That's the elegance of this function. It doesn't even touch it's return values, it delegates everyting to the block. All the things that are missing make that it does exactly what it should do.

Where java stopped

2008-03-04T23:45:00.007+02:00

Yesterday I explained my problems with garbage collection. I don't think garbage collection is bad or any, I just think it isn't being used properly. GC was invented in the late 50's for LISP, the first of high level programming languages. Lambda calculus required that the memory is managed by the system. Having freed the programmer from memory management, Lisp and is brethren enabled the development of true high level features such as dynamic typing, higher-order functions, closures, macros and continuations. These are exactly the feature that give those languages their incredible power.

That is what doesn't make sense in Java and derivatives: those really powerful features are missing in Java. Java is mostly lacking the features that absolutely require a GC. Guy Steele once said "We were after the C++ programmers. We managed to drag a lot of them about halfway to Lisp". Considering the conservativeness of the industry it's understandable they stopped halfway (the step from C to C++ was even smaller). That kind of makes Java a middle level language; a watery compromise that fails to offer the best of both worlds.

Both high and low level languages have their niches, but what about the middle level languages? My intuition tells me their proper niche should be way smaller. It's hard to say what will be Java's successor, but I'm pretty certain of two things. It will not be a Java derivative and it will take the second half step, if not more.

Garbage collection revisited

2008-03-03T22:36:00.002+02:00

Garbage collection must be one of the most misunderstood features of programming languages. GC has existed for 50 years, yet a lot of languages have not adopted it. One of the most important commonly cited advantaged of Java and similar languages over languages such as C++ is garbage collection. Having used a number of Java applications I'd dare to say most have memory problems though. They feel extremely bloated. I've seen a few programs improve drastically when some expert started to optimize the memory usage. Most Java programs can be lean enough to be usable, but it will take a lot of effort. I imagine it is a major disillusion for a lot of Java programmers to find out they haven't found a memory panacea after all.

One could see all variables and their resources as a directed graph. In the most elementary programs this graph will be a tree and in such cases manual memory management is trivial. Real programs aren't this simple. Having said that, most programs are not random or unpredictable. Usually substantial pieces of the graph are trees on their own. Acyclical graphs and even most cyclical ones can be solved using reference counting and similar techniques. If you understand the resource usage pattern of your program, you can usually solve it without resorting to garbage collection. However this often requires planning and thinking ahead of time. A GC on the other hand is able to manage any graph, mostly without help from the programmer, but does not guarantee to do so efficiently. So the question when do you need a GC? reduces to when can't I know the pattern? To date, I've only come across one concrete and common example that really can not be solved semi-manually: programming languages themselves. The reason for this is obvious: it is inherent to them that you don't know ahead how they will be used.

It is undisputed that GCs make it easier to write programs, but this issue makes me wonder if it also is easier to write good programs. Java programmers are happy to think they don't have to think about memory, but it turns out they will have to think ahead (though less than C or C++) if they want decent performance. On the server side that doesn't really matter that much, but in client-side programs it does.

In the end, it is a matter of trade-offs. Both approaches can solve most problems. I don't believe a GC by itself gives us better programs than RAII management. Having said that, garbage collection makes business sense in a LOT of situations. Companies don't earn money by building the best program they could, they earn money by selling a finished program, good or not.

Operators and types

2007-09-28T14:31:00.003+02:00

Ruby and Python overload the + operator for a large number of things, the most common ones being addition of numbers, concatenation of strings, and concatenation of tuples. Very different things are represented by the same syntax. In Perl these three roles are occupied by three different operators (+,. and ,). For that matter almost all operators for numbers are separated from operators on strings in Perl. This causes is one of the most common misconception among non-natives about Perl's typing system: it's not weakly typed. This piece of code:

my $foo = "1";
return $foo + 0;

does not cause an implicit type conversion, nor is the last statement in any way ambiguous. The addition operator causes an explicit conversion of its argument. It's not an implicit one for a simple reason: it's the Perl idiom for converting any variable to a number.

I think this is an excellent example of the waterbed theory of complexity. To reduce the number of operators in the language Python and Ruby use runtime polymorphism on data whose behavior is already known to the programmer (not to the compiler) at compile time. I cannot think of real-world code where you don't know if your variable is a number, a string or a tuple but want to do addition/concatenation nonetheless. It is trading semantic clarity for syntactic clarity. It's a valid choice, just as Perl's choice of separating them. I think a lot of rubyists and pythonistas fail to see that their choice has its disadvantages too.

Control structures in Perl

2007-09-17T19:40:00.002+02:00

As I said in my previous entry, there is a need for education in social coding skills in Perl. Therefor I'll put my money where my mouth is ;-)

Basically there are four patterns for branching in Perl.

Conditional statements: if(condition) { true_action } else { false_action }
Statement modifiers action if condition;
Logical operators: condition and action
Ternary operator: condition ? true_action : false_action

How to decide which one to use? That depends on the situation of course. Only conditional statements and the ternary operator can provide an else clause. If you need one, your choices are already limited. All four put their emphasis differently. For example print $line unless $line =~ /^#/; may be better than
$line =~ /^#/ or print $line; Because the principles of prominence. This is a linguistic notion that tells us that people tend to prefer important things to be in front and details to be in the end so they can skip over the details when scanning the code. When skipping the second part the code still makes sense (even though it may not be correct anymore). In general statement modifiers are best used when the action is much more important than the condition.

Logical operators are useful in two situations. The first one is exactly the opposite of the statement modifier: When the condition is vastly more important than the action. This usage typically uses the low precedence version. For example: open my $filehandle, $filename or die "Can't open $filename: $!"; is better than die "Can't open $filename: $!" unless open my $filehandle, $filename; because the first one communicates the intend of the programmer (opening a file) better. Error handling is not important for understanding the big picture of the code.

It has another trait that is important for deciding when to use this pattern. Unlike the previous two patterns, logical operators are expressions not statements. As such they can be used in places where the former can not. parse($filename || "default.filename") is significantly easier to read than if(!$filename) { $filename = "default.filename"; } parse($filename);

Similarly my $id; if($input >= 0) { $id = $input; } else { $id = 1; } can be simplified using the ternary operator to my $id = $input >= 0 ? $input : 1

You may wonder now, when should I use old fashioned conditional statements then? First of all, if the action contains multiple statements and isn't suitable for putting in a function. It puts an equal emphasis on the condition and the action. It makes sense to use this pattern if you don't have a reason to do otherwise.

Summary
	Conditional statements	Statement modifiers	Logical operators	Ternary operator
Emphasis	None	Action	Condition	Condition
Expression	No	No	Yes	Yes
Else clause	Yes	No	No	Yes
Nestable/Chainable	Yes, very well	No	Yes	Yes
Multiple statements	Yes	No	Yes	No

When Perl is beautiful

2007-05-21T01:03:00.000+02:00

Programming is hard. Writing maintainable programs is even harder. This is because code tends to be easier to write than to read. Writing easily readable code is almost as hard as reading it.

When writing code, one has to take two kinds of readers into consideration: computers and humans. Writing a correct program is a only matter of making your intentions clear to the computer. Writing a maintainable program however requires making it readable for your fellow humans (including yourself). For any program you're not going to throw away really soon, the latter is just as important as the former. Programs must be written for people to read, and only incidentally for machines to execute.¹

The syntax of Perl and Python are quite different from each other (even though under the hood they are much more similar than many zealots would like to admit). This difference stems mostly from a difference in how each tries to make code more accessible for humans.

Python has a philosophy of readability that is minimalist. There should be one—and preferably only one—obvious way to do it², don't ask for an other one. Python tries to make the programmer reads exactly the same as what the compiler reads.

Perl on the other hand tries something very different. Perl's creator (Larry Wall) was not only educated as a computer scientist, but also as a linguist. As such, Perl behaves more like a natural language than pretty much any other programming language out there. Where some languages such as COBOL have tried to do this by abusing half of the English dictionary, Perl does so by having a 'natural' structure. Thus it tries to fit in with the way humans naturally think. This structure, with its plenitude of operators and other syntax, gives rise to the Perl motto: There Is More Than One Way To Do It (or TIMTOWTDI).

Perl has a lot more syntax than Python, but they have more or less the same functionality. This provides Perl with a lot more bandwidth than Python has to talk to the human reader. This bandwidth comes at a price: programmers who don't know how to make use of this bandwidth will emit line noise without realising it. This phenomenon has given Perl a bad name in much of the programming world.

When people learn to program they learn to talk to the computer, but sadly most books, courses and websites forget to teach the novice programmer how to talk to other humans (Damien Conway's Perl Best Practices is the welcome exception). Perl is more affected by this lack of what I call social coding skills than other programming languages because of its design.

Good Perl code is a thing of beauty and beautiful Perl code is almost always good code. For Perl to lose its bad reputation, novice programmers need to learn how to communicate on the human channel.

1. Structure and Interpretation of Computer Programs - Abelson & Sussman
2. PEP 20: The Zen of Python - Tim Peters