Still trying to get it all out: Understanding Opcodes

A blog reader (I have readers???) recently shared his wishlist, "I'm trying to figure out how to show the opcodes like you have in your post...". I promised that I'd throw something together, so here it is:

Slow down, wtf is an "Opcode"?

Short answer: It's the compiled form of a PHP script, similar in principle to Java bytecode or .NET's MSIL. For example, say you've got the following bit of PHP script:

<?php
 echo "Hello World";
 $a = 1 + 1;
 echo $a;

PHP (and it's actual compiler/executor component, the Zend Engine) are going to go through a multi-stage process:

Scanning (a.k.a. Lexing) - The human readable source code is turned into tokens.
Parsing - Groups of tokens are collected into simple, meaningful expressions.
Compilation - Expressions are translated into instruction (opcodes)
Execution - Opcode stacks are processed (one opcode at a time) to perform the scripted tasks.

Side note: Opcode caches (like APC), let the engine perform the first three of these steps, then store that compiled form so that the next time a given script is used, it can use the stored version without having to redo those steps only to come to the same result.

Er... okay... can you elaborate a little? What's lexing? I thought superman put him in jail...

That's Lex Luthor you nit-wit! The most expedient way to explain lexing is by example. Take a look at the manual page for token_get_all(), this gem is actually a wrapper around the Zend Engine's own language scanner. Play around with it a bit, and you'll notice that plugging the short script above into it will produce:

Array
(
   [0] => Array
       (
           [0] => 367
           [1] => <?php
       )
   [1] => Array
       (
           [0] => 316
           [1] => echo
       )
   [2] => Array
       (
           [0] => 370
           [1] =>
       )
   [3] => Array
       (
           [0] => 315
           [1] => "Hello World"
       )
   [4] => ;
   [5] => Array
       (
           [0] => 370
           [1] =>
       )
   [6] => =
   [7] => Array
       (
           [0] => 370
           [1] =>
       )
   [8] => Array
       (
           [0] => 305
           [1] => 1
       )
   [9] => Array
       (
           [0] => 370
           [1] =>
       )
   [10] => +
   [11] => Array
       (
           [0] => 370
           [1] =>
       )
   [12] => Array
       (
           [0] => 305
           [1] => 1
       )
   [13] => ;
   [14] => Array
       (
           [0] => 370
           [1] =>
       )
   [15] => Array
       (
           [0] => 316
           [1] => echo
       )
   [16] => Array
       (
           [0] => 370
           [1] =>
       )
   [17] => ;
)

In the array returned by token_get_all(), you have two types of tokens: Single character non-label characters are returned as just that. The character that was found in the source file at that point. Everything else, from labels, to language constructs, to multi-character operators (like >>, +=, etc...) are returned as an array containing two elements: The token ID (which corresponds to T_* constants -- e.g. T_ECHO, T_STRING, T_VARIABLE, etc...), and the actual text which that token came from. What the engine actually gets is slightly more detailed than what you see in the output from token_get_all(), but not by much...

Okay, tokenization just breaks the script into bite-size pieces, how does parsing work then?

The first thing the parser does is throw away all whitespace (Unlike some other P* language...). From the reduced set of tokens, the engine looks for irreducible expressions. How many expressions do you see in the example above? Did you say three? WRONG There are three statements, but one of those statements is made of two distinct expressions. In the case of $a = 1 + 1; the first expression is the addition, followed by the assignment to the variable as a second, distinct expression. All together our expression list is:

echo a constant string
add two numbers together
store the result of the prior expression to a variable
echo a variable

Hey! That's starting to sound familiar! Did I see that kind of description before?

Oh, you must mean my post about strings (plug). That's correct, because these expressions are exactly the pieces which go into making up oplines! Given the expression list we've just reached, the resulting opcodes look something like:

ZEND_ECHO 'Hello World'
ZEND_ADD ~0 1 1
ZEND_ASSIGN !0 ~0
ZEND_ECHO !0

What happened to $a? What's the difference between ~0 and !0?

Short answer: !0 is $a

So here's the deal.... oplines have five principle parts:

Opcode - Numeric identifier which distinguishes what the opline will do. This is what coresponds to ZEND_ECHO, ZEND_ADD, etc...
Result Node - Most opcodes perform "non-terminal" actions. That is; after executing there's some result which can be consumed as an input to a later opline. The result node identifies what temporary location to place the result of the operation in.
Op1 Node - One of two inputs to the given opcode. An input may be a constant zval, a reference to a previous result node, a simple variable (CV), or in some cases a "special" data element, such as a class definition. Note that an opcode may use both, one, or neither input node. (Some even use more, see ZEND_OPDATA)
Op2 Node - Ditto
Extended Value - Simple integer value used to differentiate specific behaviors of an overloaded opcode.

So obviously the nodes are the most complicated parts of an opline, here's the important parts of what they look like:

op_type - One of IS_CONST, IS_TMP_VAR, IS_VAR, IS_UNUSED, or IS_CV
u - A union of the following elements (the one which is used depends on the value of op_type):
- constant (IS_CONST) - zval value. This node results which you include a literal value in your script, such as the 'Hello World' or 1 values in the example above.
- var (IS_VAR or IS_TMP_VAR or IS_CV) - Integer value corresponding to a temporary slot in a lookup table used by the engine.

Now let's look at the difference between those optypes, particularly with respect to u.var:

IS_TMP_VAR - These ephemeral values are strictly for use by non-assignment non-terminal expressions. They don't support any refcounting because they're guaranteed not to be shared by any other variable. These are denoted in the examples I use on this site (and in VLD output) as tilde characters (~)
IS_VAR - Usually the result of a ZEND_FETCH(_DIM|_OBJ)?_(R|W|RW), or one of the assignment opcodes (which are technically non-terminal expressions since they can be used as inputs to other expressions. Since these are tied to real variables, they have to respect reference counting and are passed about at an extra degree of indirection. They're stored in the same table though. These are denoted by the string symbol ($)
IS_CV - "CV" stands for "Compiled Variables". These are basicly cached hash lookups for fetching simple variables from the local symbol table. Once a variable is actually looked up at runtime, it's stored at an extra level of indirection in an even faster lookup table using an index into a vector. That's what the number in this node denotes. These types of nodes are distinguished by a bang (!)

Boggle... You...so lost me there...

Yeah, that explanation sort of got away from me didn't it? What can I clear up?

All I really want to know is how to translate some source code into an opcode..list...thingy...

Heh, okay... first off, that "opcode list thingy" is called an op_array, and you can generate those really easily using one of two PECL packages. You can use my parsekit package, which is useful for programmatic analysis of script compilation, but frankly... it's not what you're looking for and there's not much call for scripts analyzing other scripts anyway. I recommend Derick's VLD (Vulcan Logic Disasembler) which is what'll actually generate the kinds of opcode lists you'll see me use in blog posts.

Once you've got it installed (it installs like any other PECL extension), you can run it with a command like the following:

php -d vld.active=1 -d vld.execute=0 -f yourscript.php

Then sit back and watch the opcodes fly! Important note: Using -r with command line code may not work due to a quirk of the way the engire parses files in older versions of PHP (and with older versions of VLD). Be sure to put your script on disk and reference it using -f if -r doesn't work for you.

Holy schnikies! That's a lot of opcodes! How can I tell what they all do?

Take a look at Zend/zend_vm_def.h in your PHP source tree. In here you'll find a meta-definition of every single opcode used by the engine. Side note: It's used as a source for zend_vm_gen.php which generates the actual code file zend_vm_execute.h. How's that for chicken and egg? Every version of PHP since 5.1.0 has required PHP be already built in order to build it!

3 comments:

David HarknessApril 30, 2013 at 1:09:00 PM PDT
Brilliant! Thanks a ton for this post.
Derick RethansNovember 2, 2013 at 9:20:00 AM PDT
There is actually a full list of what the opcodes do now at http://www.php.net/manual/en/internals2.opcodes.list.php
drcreazyNovember 19, 2013 at 2:16:00 AM PST
Thanks for sharing. Please add info about step where semantic analyzer works

Note: Only a member of this blog may post a comment.

Still trying to get it all out

Jan 19, 2008

Understanding Opcodes

Slow down, wtf is an "Opcode"?

Er... okay... can you elaborate a little? What's lexing? I thought superman put him in jail...

Okay, tokenization just breaks the script into bite-size pieces, how does parsing work then?

Hey! That's starting to sound familiar! Did I see that kind of description before?

What happened to $a? What's the difference between ~0 and !0?

Boggle... You...so lost me there...

All I really want to know is how to translate some source code into an opcode..list...thingy...

Holy schnikies! That's a lot of opcodes! How can I tell what they all do?

3 comments:

Favorites

Extension Writing

External links

Blog Archive

About Me

Still trying to get it all out

Jan 19, 2008

Understanding Opcodes

Slow down, wtf is an "Opcode"?

Er... okay... can you elaborate a little? What's lexing? I thought superman put him in jail...

Okay, tokenization just breaks the script into bite-size pieces, how does parsing work then?

Hey! That's starting to sound familiar! Did I see that kind of description before?

What happened to $a? What's the difference between ~0 and !0?

Boggle... You...so lost me there...

All I really want to know is how to translate some source code into an opcode..list...thingy...

Holy schnikies! That's a lot of opcodes! How can I tell what they all do?

3 comments:

Favorites

Extension Writing

External links

Subscribe...

Blog Archive

About Me