Jan 19, 2008

Understanding Opcodes

A blog reader (I have readers???) recently shared his wishlist, "I'm trying to figure out how to show the opcodes like you have in your post...". I promised that I'd throw something together, so here it is:


Slow down, wtf is an "Opcode"?

Short answer: It's the compiled form of a PHP script, similar in principle to Java bytecode or .NET's MSIL. For example, say you've got the following bit of PHP script:

<?php
echo "Hello World";
$a = 1 + 1;
echo $a;

PHP (and it's actual compiler/executor component, the Zend Engine) are going to go through a multi-stage process:

  1. Scanning (a.k.a. Lexing) - The human readable source code is turned into tokens.
  2. Parsing - Groups of tokens are collected into simple, meaningful expressions.
  3. Compilation - Expressions are translated into instruction (opcodes)
  4. Execution - Opcode stacks are processed (one opcode at a time) to perform the scripted tasks.
Side note: Opcode caches (like APC), let the engine perform the first three of these steps, then store that compiled form so that the next time a given script is used, it can use the stored version without having to redo those steps only to come to the same result.

Er... okay... can you elaborate a little? What's lexing? I thought superman put him in jail...

That's Lex Luthor you nit-wit! The most expedient way to explain lexing is by example. Take a look at the manual page for token_get_all(), this gem is actually a wrapper around the Zend Engine's own language scanner. Play around with it a bit, and you'll notice that plugging the short script above into it will produce:

Array
(
[0] => Array
(
[0] => 367
[1] => <?php
)
[1] => Array
(
[0] => 316
[1] => echo
)
[2] => Array
(
[0] => 370
[1] =>
)
[3] => Array
(
[0] => 315
[1] => "Hello World"
)
[4] => ;
[5] => Array
(
[0] => 370
[1] =>
)
[6] => =
[7] => Array
(
[0] => 370
[1] =>
)
[8] => Array
(
[0] => 305
[1] => 1
)
[9] => Array
(
[0] => 370
[1] =>
)
[10] => +
[11] => Array
(
[0] => 370
[1] =>
)
[12] => Array
(
[0] => 305
[1] => 1
)
[13] => ;
[14] => Array
(
[0] => 370
[1] =>
)
[15] => Array
(
[0] => 316
[1] => echo
)
[16] => Array
(
[0] => 370
[1] =>
)
[17] => ;
)

In the array returned by token_get_all(), you have two types of tokens: Single character non-label characters are returned as just that. The character that was found in the source file at that point. Everything else, from labels, to language constructs, to multi-character operators (like >>, +=, etc...) are returned as an array containing two elements: The token ID (which corresponds to T_* constants -- e.g. T_ECHO, T_STRING, T_VARIABLE, etc...), and the actual text which that token came from. What the engine actually gets is slightly more detailed than what you see in the output from token_get_all(), but not by much...

Okay, tokenization just breaks the script into bite-size pieces, how does parsing work then?

The first thing the parser does is throw away all whitespace (Unlike some other P* language...). From the reduced set of tokens, the engine looks for irreducible expressions. How many expressions do you see in the example above? Did you say three? WRONG There are three statements, but one of those statements is made of two distinct expressions. In the case of $a = 1 + 1; the first expression is the addition, followed by the assignment to the variable as a second, distinct expression. All together our expression list is:

  1. echo a constant string
  2. add two numbers together
  3. store the result of the prior expression to a variable
  4. echo a variable

Hey! That's starting to sound familiar! Did I see that kind of description before?

Oh, you must mean my post about strings (plug). That's correct, because these expressions are exactly the pieces which go into making up oplines! Given the expression list we've just reached, the resulting opcodes look something like:

  • ZEND_ECHO 'Hello World'
  • ZEND_ADD ~0 1 1
  • ZEND_ASSIGN !0 ~0
  • ZEND_ECHO !0

What happened to $a? What's the difference between ~0 and !0?

Short answer: !0 is $a


So here's the deal.... oplines have five principle parts:

  • Opcode - Numeric identifier which distinguishes what the opline will do. This is what coresponds to ZEND_ECHO, ZEND_ADD, etc...
  • Result Node - Most opcodes perform "non-terminal" actions. That is; after executing there's some result which can be consumed as an input to a later opline. The result node identifies what temporary location to place the result of the operation in.
  • Op1 Node - One of two inputs to the given opcode. An input may be a constant zval, a reference to a previous result node, a simple variable (CV), or in some cases a "special" data element, such as a class definition. Note that an opcode may use both, one, or neither input node. (Some even use more, see ZEND_OPDATA)
  • Op2 Node - Ditto
  • Extended Value - Simple integer value used to differentiate specific behaviors of an overloaded opcode.

So obviously the nodes are the most complicated parts of an opline, here's the important parts of what they look like:

  • op_type - One of IS_CONST, IS_TMP_VAR, IS_VAR, IS_UNUSED, or IS_CV
  • u - A union of the following elements (the one which is used depends on the value of op_type):
    • constant (IS_CONST) - zval value. This node results which you include a literal value in your script, such as the 'Hello World' or 1 values in the example above.
    • var (IS_VAR or IS_TMP_VAR or IS_CV) - Integer value corresponding to a temporary slot in a lookup table used by the engine.

Now let's look at the difference between those optypes, particularly with respect to u.var:

  • IS_TMP_VAR - These ephemeral values are strictly for use by non-assignment non-terminal expressions. They don't support any refcounting because they're guaranteed not to be shared by any other variable. These are denoted in the examples I use on this site (and in VLD output) as tilde characters (~)
  • IS_VAR - Usually the result of a ZEND_FETCH(_DIM|_OBJ)?_(R|W|RW), or one of the assignment opcodes (which are technically non-terminal expressions since they can be used as inputs to other expressions. Since these are tied to real variables, they have to respect reference counting and are passed about at an extra degree of indirection. They're stored in the same table though. These are denoted by the string symbol ($)
  • IS_CV - "CV" stands for "Compiled Variables". These are basicly cached hash lookups for fetching simple variables from the local symbol table. Once a variable is actually looked up at runtime, it's stored at an extra level of indirection in an even faster lookup table using an index into a vector. That's what the number in this node denotes. These types of nodes are distinguished by a bang (!)

Boggle... You...so lost me there...

Yeah, that explanation sort of got away from me didn't it? What can I clear up?


All I really want to know is how to translate some source code into an opcode..list...thingy...

Heh, okay... first off, that "opcode list thingy" is called an op_array, and you can generate those really easily using one of two PECL packages. You can use my parsekit package, which is useful for programmatic analysis of script compilation, but frankly... it's not what you're looking for and there's not much call for scripts analyzing other scripts anyway. I recommend Derick's VLD (Vulcan Logic Disasembler) which is what'll actually generate the kinds of opcode lists you'll see me use in blog posts.


Once you've got it installed (it installs like any other PECL extension), you can run it with a command like the following:

php -d vld.active=1 -d vld.execute=0 -f yourscript.php

Then sit back and watch the opcodes fly! Important note: Using -r with command line code may not work due to a quirk of the way the engire parses files in older versions of PHP (and with older versions of VLD). Be sure to put your script on disk and reference it using -f if -r doesn't work for you.


Holy schnikies! That's a lot of opcodes! How can I tell what they all do?

Take a look at Zend/zend_vm_def.h in your PHP source tree. In here you'll find a meta-definition of every single opcode used by the engine. Side note: It's used as a source for zend_vm_gen.php which generates the actual code file zend_vm_execute.h. How's that for chicken and egg? Every version of PHP since 5.1.0 has required PHP be already built in order to build it!

Jan 17, 2008

I'm syndicated, and it has nothing to do with PHP!

I was running some test queries using my usual spread of values and came across a result of "Lieutenant Fluffy?" which was far too whimsical a topic to ignore. Turns out someone thought my holiday photos made for a good creative commons pick. Kudos to the news site for respecting licensing terms properly!

Jan 7, 2008

Houston, we have a bolus

It took three weeks to filter through Kaiser, but my new insulin pump finally arrived (the purple one on top in the photo below is the new one). My old pump was about seven years old when it broke down for the second time (the first was under warranty, about four years ago). By the way, the old one *IS* turned on, and you *SHOULD* see something on the display.... Hence my problem... Anyway, I wasn't expecting it till tomorrow, but christmas came early (well, I guess technically late) 'cause the FedEx guy rang the bell just as I was headed out the door for work.

SHINEY!!!!!! I haven't had a decent basal profile in nearly a month, so I ripped open the packaging, read the important parts of the instructions and plugged in.... Ah, that's the stuff... I'm now reading through the rest of the manual, here's my thoughts so far:

  • Batteries: A+ The MiniMed508 takes three Energizer 357s which... while not terribly hard to find, can be pretty annoying when you forget to replace them and the unit shuts down at 1AM... The new 722 fixes that by taking the larger, but more readily available AAA size.
  • Menus: B There's more complexity to the menus (driven by a wider feature set), but the most common task (bolusing) is a quick single-button action. I'd have marked this as an A, but for the fact the bolus amounts no longer wrap-around. Pressing down from 0.0 stays at 0.0, and pressing up from 10.0 stays at 10.0.Update: Turns out the wrap-around does work. You just have to enable it.... Not sure how I enabled it though... Anyway, I'm leaving this at B since the the nag-messages are a bit too insistent... I know what I'm doing damnit... you don't have to be so annoying!
  • Backlight: C Same backlight, still not quite enough contrast...
  • Delivery mechanics: A Blatently stolen from Diesetronic (a competitor), but a good steal. Smoother, quicker delivery and a nice resevoir window to boot.
  • Loading mechanics: A- The resevoir-vial interface is a tiny bit temperamental, but once I get the knack of it, I expect cartridge loading will be twice as fast and accurate as it used to be. The priming mechanics also manage to overcome one of my long standing gripes about volume detection.
  • Interface/API: F Same gripe as I had with the MM508. The system is capable of interfacing with external equipment, but the company won't share anything useful about their specs (I knew this from research before hand, but I can still complain about it).
  • Durability: TBD We'll see how this one puts up with my.... abuse is too harsh a word.... "active lifestyle".... Let's go with that... At least it's (supposedly) more water-proof than the old one...