Sunday morning I was asked by an IRC regular: "Where does the engine parse quoted strings?". Being a sunday morning, I began to launch into a sermon on the distinction between CONSTANT_ENCAPSED_STRING and the problems which befall a single-pass compiler when you start to introduce interpolation. Not what he asked precisely, but an important component in answering his question. Unfortunately, at the time I was busy watching the Brasil-Australia game so I didn't go into the kind of detail I would have. Now, some 12 hours later, since Angela is off buying toe-socks in Santa Cruz, I'll bore anyone with little enough life to read my blog by explaining the pitfalls of using PHP's string interpolation without using an optimizer.
To start things off, let's take a page from my earlier discourse on Compiled Variables and look at the opcodes generated by a few simple PHP scripts:
<?php
echo "This is a constant string";
?>
Yields the nice, simple opcode:
ECHO 'This is a constant string'
No problem... Exactly what you'd expect... Now let's complicate the expressions a little:
<?php
echo "This is an interpolated $string";
?>
Yields the surprisingly messy instruction set:
INIT STRING ~0
ADD_STRING ~0 ~0 'This'
ADD_STRING ~0 ~0 ' '
ADD_STRING ~0 ~0 'is'
ADD_STRING ~0 ~0 ' '
ADD_STRING ~0 ~0 'an'
ADD_STRING ~0 ~0 ' '
ADD_STRING ~0 ~0 'interpolated'
ADD_STRING ~0 ~0 ' '
ADD_VAR ~0 ~0 !0
ECHO ~0
Where !0 represents the compiled variable named $string. Looking at these opcodes: INIT_STRING allocates an IS_STRING variable of one byte (to hold the terminating NULL). Then it's realloc'd to five bytes by the first ADD_STRING ('This' plus the terminating NULL). Next it's realloc'd to six bytes in order to add a space, then again to eight bytes for 'is', then nine to add a space, and so on until the temporary string has the contents of the interpolated variable copied into its contents before being used by the echo statement and finally discarded. Now let's rewrite that line to avoid interpolation and use concatenation instead:
<?php
echo "This is a concatenated " . $string;
?>
Which yields the significantly shorter and simpler set of ops:
CONCAT ~0 'This is a concatenated ' !0
ECHO ~0
A vast improvement already, but this version still creates a temporary IS_STRING variable to hold the combined string contents meaning that data is duplicated when it's being used in a const context anyway. Now let's try out this oft-overlooked use of the echo statement:
<?php
echo "This is a stacked echo " , $string;
?>
Look close, there is a meaningful difference from the last one. This time we're using a comma rather than a dot between the operands. If you don't know what the comma is doing there, ask the manual then check back here. Here's the resulting opcodes:
ECHO 'This is a stacked echo '
ECHO !0
Same number of opcodes, but this time no temporary variables are being created so there's no duplication and no pointless copying (unless of course $string wasn't of type IS_STRING, in which case it does have to be converted for output, but don't get picky now). Think this is bad? Consider the average heredoc string which spans several lines of prepared output embedding perhaps a handful of variables along the way. Here's one of several such blocks found in run-tests.php within the PHP distribution source tree:
<?php
echo <<NO_PCRE_ERROR
+-----------------------------------------------------------+
| ! ERROR ! |
| The test-suite requires that you have pcre extension |
| enabled. To enable this extension either compile your PHP |
| with --with-pcre-regex or if you've compiled pcre as a |
| shared module load it via php.ini. |
+-----------------------------------------------------------+
NO_PCRE_ERROR;
?>
Notice that we're not even embedding variables to be interpolated here, yet does this come out to a simple, single opcode? Nope, because the rules necessary to catch a heredoc's end token demand the same careful examination as double-quoted variable substitution and you wind up (in this case) with SEVENTY-EIGHT opcodes! One INIT_STRING, 76 ADD_STRINGs. and a final ECHO. That means a malloc, 76 reallocs, and a free which will be executed every time that code snippet comes along. Even the original contents take up more memory because they're stored in 76 distinct zval/IS_STRING structures.
Why does this happen? Because there are about a dozen ways that a variable can be hidden inside an interpolated string. Similarly, when looking for a heredoc end-token, the token can be an arbitrary length, containing any of the label characters, and may or may not sit on a line by itself. Put simply, it's too difficult to encompass in one regular expression.
The engine could perform a second-pass during compilation, however the time saved reassembling these strings will typically be about the same amount of time spent actually processing them during runtime (if one assumes that each instance will execute exactly once). Rather than complicate the build process (potentially slowing down overall run-times in the process), the compiler leaves this optimization step to opcode caches which can achieve exponentially greater advantage cleaning up this mess then caching the results and reusing the faster, leaner versions on all subsequent runs.
If you're using APC, you'll find just such an optimizer built in, but not enabled by default. To turn it on, you'll need to set apc.optimization=on in your php.ini. In addition to stitching these run-on opcodes back together, it'll also add run-time speed-ups like pre-resolving persistent constants to their actual values, folding static scalar expressions (like 1 + 1) to their fixed results (e.g. 2), and simpler stuff like avoiding the use of JMP when the target is the next opcode, or boolean casts when the original expression is known to be a boolean value. (It should be noted that these speed-ups also break some of the runtime-manipulation features of runkit, but that was stuff you....probably should have been doing anyway)
Can't use an optimizer because your webhost doesn't know how to set php.ini options? You can still avoid 90% of the INIT_STRING/ADD_STRING dilema by simply using single quotes and concatenation (or commas when dealing with echo statements). It's a simple trick and one which shouldn't harm maintainability too much, but on a large, complicated script, you just might see an extra request or two per second.