View Full Version : Dynamic Recompilation - An Introduction
M.I.K.e7
June 10th, 2002, 12:22
In another thread sniff_381 asked me to explain what dynamic recompilation (dynarec for short) is.
Since most NG emus have a dynarec now, I thought it might be a good idea to cover the topic in a separate thread.
I have to admit that I haven't programmed a dynarec myself yet, but I have decent knowledge of the basics and some details. I'm not totally sure how much I should go into depth anyway. I guess I'll see if there is enough interest to talk about details.
First of all, the term "dynamic recompilation" is a bit odd, because "to recompile" often means to compile the source code of a program again, but the older term "binary translation" is more precise, since the binary code of a game or application is translated - not the source code - and "dynamic" only means that it is done during runtime and on demand.
So what's the difference to "traditional" or "interpretive" emulation?
An interpretive emulator always picks the instruction the program counter (PC) points to, decodes it, and executes it, just like a real processor would do it. So every time the emulator comes across it same instruction it has to do all the steps again.
In his article How To Write a Computer Emulator (http://fms.komkon.org/EMUL8/HOWTO.html) Marat Fayzullin uses this pseudo C-code sequence to describe the process:
Counter=InterruptPeriod;
PC=InitialPC;
for( ;; )
{
OpCode=Memory[PC++];
Counter-=Cycles[OpCode];
switch(OpCode)
{
case OpCode1:
case OpCode2:
...
}
if(Counter<=0)
{
/* Check for interrupts and do other */
/* cyclic tasks here */
...
Counter+=InterruptPeriod;
if(ExitRequired) break;
}
}
Dynamic recompilation deviates from this procedure by working with whole blocks of code instead of single instructions, and that those blocks are translated into the machine language of the processor the emulator is running on. There wouldn't be a speed advantage if the translated blocks weren't cached and simply recalled as soon as the program counter enters that block again.
Here is some sample code from my (unfortunately still unfinished) DRFAQ (http://www.dynarec.com/~mike/drfaq.html):
/* the following line defines a 'function pointer', */
/* which can be used to call the code generated by the translator */
/* CTX is the context of the processor, ie. the register values */
int (*dyncode)(Context *CTX);
/* the following simplyfied loop is often called the "dispatcher" */
for( ;; ) {
/* try to find the current address of the PC in the translation cache */
address = block_translated(CTX->PC);
/* nothing found, ie. first translate the code block starting at the PC address */
if (address == NULL)
/* do the translation and add it to the translation cache */
address = translate_block(CTX->PC);
/* point the function pointer to the address of the translated code */
dyncode = (int(*)(Context*)) address;
/* call the translated code with the current context */
status = (*dyncode)(CTX);
/* handle interrupts and other events here */
}
That's basically how a dynarec works, only that I still haven't explained how the translation cache and of course the translation are handled.
I spoke of code blocks several times, and it might be a good idea to define the term, since not all will be into compiler theory...
In compilers the smallest block of cohesive instructions is called a basic block. Such a block has a starting point and ends with the next conditional jump or branch, ie. as soon as there is a possibility that the program counter changes apart from pointing to the next instruction the block ends. It's also important that no other code block can jump into the middle of the basic block, only at the starting address, because only that way the compiler can see it as a separate collection of code that can be optimized in every possible way.
Most dynarecs probably work with basic blocks, but some use end the block with the next unconditional jump or branch, which leads to larger blocks, often called translation units. This leads to faster code, because all conditional branches can jump in the translated code without having to go through the dispatcher loop first, but it can be problematic to handle interrupts since you don't have a guarantee that the code returns to the dispatcher. Of course that could be handled within the generated code, but that makes things more complicated.
I think that's enough for an introduction. If there are any questions feel free to ask, and if there is interest in extending the parts I haven't covered yet, I could write something about the translation cache, some translation problems, register allocation, the difference to threaded interpretation, etc.
Nezzar
June 10th, 2002, 12:45
Whoo, what an article.
I think my C needs some improvements :P
M.I.K.e7
June 10th, 2002, 12:54
I have to admit that I also had to look up fuction pointers in K&R because I simply didn't need them before I thought of how to call dynamically generated code.
But using function pointers is much cleaner than some assembly hacks I've seen in real dynarecs.
I forgot it today, but tomorrow I might post a nice little example that explains how you can call generated code with a function pointer.
sithspawn
June 11th, 2002, 02:03
A very good starter for emu author enthusiasts (but could be intimidating to some :p). Looking forward to seeing more of this.
M.I.K.e7
June 11th, 2002, 08:30
As announced yesterday I'll provide a simple example now that shows how to call dynamically generated code. For some this might be even more intimidating, but those who looked up how function pointers work in C and have a slight understanding of x86 assembly it should make clear how that calling process works.
/* In the beginning we'll have to define the function pointer. */
/* I called the function 'dyncode' and gave it an int argument */
/* as well as an int return value just to show what's possible. */
int (*dyncode)(int); /* prototype for call of dynamic code */
/* The following char array is initialized with some binary code */
/* which takes the first argument from the stack, increases it, */
/* and returns to the caller. */
/* Just very simple code for testing purposes... */
unsigned char code[] = {0x8B,0x44,0x24,0x04, /* mov eax, [esp+4] */
0x40, /* inc eax */
0xC3 /* ret */
};
/* Include the prototypes of the functions we are using... */
#include < stdio.h >
int main(void)
{
/* To show you that the code can be dynamically generated */
/* although I defined static data above, the code is copied */
/* into an allocated memory area and the starting address is */
/* assigned to the function pointer 'dyncode'. */
/* The strange stuff in front of the malloc is just to cast */
/* the address to the same format the function pointer is */
/* definded with, otherwise you'd get a compiler warning. */
dyncode = (int (*)(int)) malloc(sizeof(code));
memcpy(dyncode, code, sizeof(code));
/* To show that the code works it is called with the argument 41 */
/* and the retval sould be 42, obviously. */
printf("retval = %d\n", (*dyncode)(41) ); /* call the code and print the return value */
return 0;
}
This code has been written with GCC in mind, but it should work with any C compiler on any x86 operating system that passes function arguments on the stack.
I originally wrote this example with some ARM machine code instead of x86, and all that I had to change was the definition of the code[] array.
That's the nice thing about working with a function pointer to call dynamic code, apart from the generated code everything else is totally portable to any system with a C compiler.
A warning to those working with harvard architecture processors (ie. those with split instruction and data caches):
After copying the code and before calling it you'll have to flush the caches, otherwise the code will be in the data cache but not in the instruction cache and the processor will get into trouble.
While x86 processors nowadays have split L1 caches as well it's not a problem on these because they solve such issues in hardware due to transpartent compatibility with x86 processors that still had a unified cache.
Ok, so much for dynamically generated code...
Anyone still with me?
Shiori
June 11th, 2002, 08:33
Finally, a decent, intelligent thread.... the first for this day, i think. :)
fivefeet8
June 11th, 2002, 08:38
Can Dynarec be done with C++? I am currently learning C++ programming.. Very early learning mind you.. Looks interesting though..
M.I.K.e7
June 11th, 2002, 09:00
@Shiori:
Someone has to improve the niveau ;-)
@fivefeet8:
I'm not that much into C++ but I think it should be possible to do a dynarec in C++ as well.
IIRC David Sharp did his tARMac project (http://www.dynarec.com/~dave/tarmac/index.html) in C++.
But no matter what a C++ enthusiast will tell you, the language has a certain overhead compared to C, and when you need to use a dynarec to emulate a system at full speed you probably don't want that overhead as well. There is a reason why most emulators are written in plain ANSI C.
M.I.K.e7
June 11th, 2002, 09:23
Those who are still following this thread probably noticed that dynamic recompilation is far from being trivial, and I haven't even touched the more complex issues yet.
This leads to the ultimate question: When does it make sense to use dynamic recompilation anyway?
The advantages of dynamic recompilation are:
more speed
more speed (nothing else really)
The disadvantages of dynamic recompilation are:
quite complicated
hard to debug
not as exact as interpretive emulation
not portable to systems with other processors
problems with self-modifying code
This means, as long as you can pull the whole emulation off at a decent speed by using traditional emulation (perferable even a portable solution), just do it and don't give dynamic recompilation a second thought.
Although I've seen people toying with dynarecs for 6502 and similar 8-bit processors it's not worth the hassle, since a nice CPU core written in C would be portable to different systems and should run at full speed on any current and even most older computers.
Even most 16-bit processors should be tried in interpretive emulation before thinking of dynamic recompilation. One of the few reasonable 16-bit candidates would be the 68000, because it is widely used and quite complex, so a dynarec for it might speed up a lot of emulators if you stick to the same API.
Where dynamic recompilation really shines is 32-bit and 64-bit processors, because it makes sense to do operations on hardware registers when the original code does so. Especially the MIPS (used in PSX and N64, eg.) and SuperH (Saturn eg.) processors with their simple instruction set should be emulated via dynamic recompilation to get a decent speed.
One thing to keep in mind is that an emulator with a dynarec needs a lot of RAM, because it not only needs at least the same amount of memory as a traditional emulator but additionally also memory for the translation cache, ie. the code blocks that have already been translated.
Eventually, it's a good idea to start with interpretive emulation to see if that's fast enough and switch to dynamic recompilation when it isn't. During the switching process it's a good idea to keep both CPU emulations to test the dynarec against the interpreter which should make debugging a little easier.
M.I.K.e7
June 11th, 2002, 11:14
Some emulators that claim to use dynamic recompilation actually utilize a technique called threaded interpretation, eg. Generator does that (http://www.squish.net/generator/docs.html).
How does threaded interpretation differ from dynamic recompilation and what do they have in common?
Both techniques work on code blocks and "translate" these into some other representation. This means that both share the disadvantage of needing more memory than a traditional emulator and having problems when the already translated code should be changed by the translation (keyword: self-modifying code).
But instead of translating to code threaded interpretation fills the translation cache with addresses to the instruction emulation routines instead, ie. each instruction found in the source binary will be translated to an address and parameters that point to a piece of code in the emulator that emulates this instruction.
The only thing that you spare compared to a traditional emulator so far is the repetive decoding of all instructions in a block. But threaded emulation can take one further step. Due to having to analyze a whole block of code you can find out which condition flags need to be calculated for an instruction, ie. if a certain flag is overwritten by the side-effect of a following instruction before it can be tested or taken as input by another it doesn't have to be calculated. Since calculation of condition flags often can need more than half the time to emulate an instruction this approach can lead to a noticable speed improvement.
The emulator Generator mentioned above has two different emulation functions for each single instruction, one that calculates all flags and another that doesn't calculate any flags. The address of one of these functions will be added to the translation cache as appropriate.
The advantage of threaded emulation is that it can be portable when it is programmed in a high-level lamguage (like C) and works with function pointers.
The disadvantage is that you cannot access hardware registers directly (unless the instruction routines are written in assembly and you are using static register allocation, but then it wouldn't be portable and you could also use dynamic recompilation), so you still need to access the register file in memory every time an instruction reads or alters a register.
I think that should be enough about threaded interpreting...
Maybe I should cover register allocation next, since I already mentioned it here.
ShADoWFLaRe85
June 11th, 2002, 14:00
Holy Cow! :eek:
This is probably one of the most informative threads I've ever seen!
Nezzar
June 11th, 2002, 15:39
It is the ONLY informative thread I've ever seen :p j/k
Skye
June 11th, 2002, 19:44
TOO much information! ;)
M.I.K.e7
June 12th, 2002, 07:47
Ok, maybe I should sum up some of the stuff here that no one can complain about too much information ;-)
Dynamic recompilation also known as dynamic binary translation is the process of translating binary code blocks during runtime into the binary code of the host machine.
The translated code is collected in a translation cache.
In a dispatcher loop the dynarec decides if certain code blocks still have to be translated and eventually calls the translated code.
The generated code is ideally called using a function pointer.
Here is an example how the function pointer code above has to be changed to get the same result on an ARM processor based machine:
unsigned long code[] = {0xE2800001, /* ADD R0, R0, #1 */
0xE1A0F00E /* MOV PC, LR */
};
As you can see only the code[] array (that would be the generated code in a real dynarec) has to be changed. The switch from "unsigned char" to "unsigned long" has only been made due to the fact that ARM has 32-bit fixed lenght instructions, but since we cast it to the function pointer later there is no difference.
Of course the code generator has to be different on each processor, but it makes sense to make everything else portable, thus you only have to write a new code generator but not a totally new emulator.
Mind you that an ARM processor with separate instruction and data caches (eg. the StrongARM) needs its caches to be flushed before the code can be called, but that's operating system specific and I won't go into that detail here.
fivefeet8
June 12th, 2002, 08:09
Question? Does epsxe use Dynarec? I think Bleem did? I wonder what other emus use Dynarec..
Nezzar
June 12th, 2002, 08:15
ePSXe surely does. DynaRec is heck fast and so epsxe is :p
UltraHLE/SupraHLE uses a DynaRec-Core as well, afaik.
M.I.K.e7
June 12th, 2002, 08:20
Every decent PSX and N64 emulator should be using a dynarec today, and ePSE surely does. I'm not totally sure about Bleem, but I guess it had one too.
There is a list of open source emulators with dynarecs on Dynarec.com (http://www.dynarec.com/dynarecs.html).
FPSE used to be on the list too, but then it went closed source and the old source was rather outdated, so I removed it. I guess I'll add GPSE to the list someday.
As I said in some previous posting those dynarecs for 6502 are merely toys and there isn't much need for them. The really interesting things are the dynarecs for processors like MIPS that would need a monster machine to run at full speed using interpretive emulation.
M.I.K.e7
June 12th, 2002, 08:23
UltraHLE was the first N64 emulator with a dynarec IIRC. And it's also said to use some dirty tricks. I might get to one of these tricks when I talk about translation caching.
M.I.K.e7
June 12th, 2002, 09:33
One of the big advantages of dynamic recompilation is that you can actually use the registers of the host machine when an emulated instruction uses registers. This can not only reduce the amount of instructions needed to emulate another instruction but also minimize slow memory accesses for every referenced register.
The process of mapping the emulated registers to the registers of the host machine is called register allocation.
Unfortunately a lot of dynarecs in emulators still fetch all needed register values from the register file (just a memory structure where the contents of all registers of the emulated processor) in the beginning of each emulated instruction and write the result back to the register file afterwards. Only a little better are those that don't store the value of the result register if it is used as input in the following instruction.
This really spoils a large part of what dynamic recompilation is about, but still seen quite often in "real world" examples.
There are two different methods of allocating registers:
Static register allocation: This means that in every translated block the same emulated registers are always allocated to the same host registers. When the host machine has enough registers to hold all emulated registers this is the optimal solution, but there are also some advantages even if it has fewer registers (mainly related to timing and translation block handling; I'll cover that later). In the latter case this means that only some of the emulated registers are held in host registers (ideally the most often used ones) and for the remaining ones the register file is accessed.
Dynamic register allocation: In every translation block the registers are allocated differently. Ideally you load the values of the emulated registers into the host registers at the beginning of the block and store those that were modified back to the register file before the block returns control to the dispatcher loop. If there aren't enough registers to hold all registers used in the block you'll have store the value held in a host register to free it for the next one.
Implementation wise static register allocation is rather easy, because you always use the same host registers to hold specific emulated registers and access the memory locations in the register file for all the others. This also means that you always load the same register values and store them to the same memory locations afterwards no matter what translated code block you are executing. So you only need one setup/clean-up code for all blocks which is occasionally called glue code because it's the thing that connects the emulator and the generated code.
The implementation of dynamic register allocation is a bit more complicated though. First of all there is more bookkeeping to do, since the emulator has to remember these facts:
which host register holds which emulated register: you need to know if the register is already in use and where to store the value to if you need to free the register
which emulated register is held in which host register: this tells you if the register is already allocated, and if it is you know which register to use in the generated code
has the value held in the host register been modified? This isn't really necessary, but it's a good idea not to store a value to the register file that hasn't been changed since you spare another unnecessary memory access.
How does register replacement work when you run out of registers?
There are very many methods that could be applied, and probably only one would be ideal, which basically means that you replace that register which isn't needed in that block anymore. Since you do have the entire block you could actually go through it backwards to find out if there is a register the code doesn't use anymore, but that can be a bit tedious.
That ideal solution reminded me of the best but theoretical solution for a page replacement algorithm in operating systems, so I came up with the idea of using another page replacement algorithm called second chance, which is similar to LRU (least recently used) but simpler to implement.
For that algorithm you set a reference flag for the host register every time it is referenced during code generation. When you need a register you go through all host registers in a circle (using a modulo operation with the maximal number of host registers), pick the next register register where the reference flag isn't set, while unsetting the reference flag of all registers you have to skip. Probably sounds a bit weird, but it's easy to implement and should lead to good results.
Of course if you come up with an easy implementation of the optimal solution that would be perfect.
I guess that's the most important things about register allocation...
If you want to know how to handle 64-bit registers on the IA-32 then take a look at the 1964 documentation (http://e64.wwemu.com/emus/1964/1964_recompiler_doc_101.pdf).
sithspawn
June 12th, 2002, 10:19
You should go write a book about this, and publish it, and donate some of the money you earned to NGEmu :p
M.I.K.e7
June 12th, 2002, 10:50
You might laugh, but I already thought of writing something book-like about that topic, but it's always a time issue and I doubt that anybody would want to buy or even publish it.
If I should be insane enough to write a larger text about it, I'd probably distribute it as in PDF.
M.I.K.e7
June 12th, 2002, 12:33
Our next topic is the translation cache. But I won't go into detail how the generated code blocks are actually stored and freed, since I haven't given it too much thought yet. Although it is most likely that it's a good idea to allocate a large block of memory via the operating system and then make your own arena managament in that chunk for performance reasons.
The most interesting thing is how to remember which block is already translated and how to do that fast.
For 8-bit processors which normally have a 16-bit address range it would be easy to simply make an array of addresses and mark each single address when it is the start address of a recompiled block. But for those processors where dynamic recompilation is really interesting, this simple approach does not work, even if you take into account that most of the processors only permit aligned instructions. So other, less memory consuming methods have to be found.
When you come straight from computer science studies the most obvious solution would be a hash. That means a value is mapped through a special hash function (often containing a modulo operation to define the confines) onto a much smaller range array. The advantage is that in the best case you only perform the simply function and get the key to the memory location where you find the address of the recompiled code with one memory access (or a zero if it hasn't been recompiled yet). The problem is that the hash function is likely to generate the same key for different values, which happens more often when the hash array is too small and/or the hash function isn't that good. When this happens you get a so-called collision, which has to be solved somehow. The typical solution is to do a linked list of all values that collided, but that means that you need several lookups since you'll have to search the list in linear order until you find what you want.
So hashing is a possible solution, but not necessarily the best one.
Another problem that might have to be solved during the translation cache lookup is to detect self-modifying code. If the code behaves nicely you can skip that, but if self-modifying code occurs from time to time you'll have to detect it.
The typical computer science solution would likely be to run some kind of checksum (eg. CRC) over the original code, and regenerate the checksum every time the block is about to be executed. If the checksum has changed the block has to be translated again.
The problem here is that checksums can be fooled when several values (in the case instruction encodings) change but the change is not visable in the checksum. Also recalculating the checksum before every run of the code should hit on performance hard.
Since traditional solutions aren't perfect let's see what alternatives there are...
A drity trick that UltraHLE is said to be using works as follows:
Instead of using a data structure to memorize which blocks have been translated, the first instruction of the block is replaced by an illegal instruction that also contains the offset to the generated code - since MIPS has 32-bit instructions that's quite possible. So the emulator just takes a look at the start of the block to recognize if it has been translated already or if it has to be translated still. A side-effect is that code that modifies the first instruction of the block leads to the block being translated again automatically.
The disadvantage is that self-modifying code is only detected when the illegal instruction is replaced, and since you modify the original code you might run into problems when that block is actually just a sub-block of a larger block that might run into that illegal instruction. This could be handled of course, when the original instruction is stored somewhere, but it makes things a bit more complicated.
The elegant solution would be a paged translation map. When you don't know what to do in emulation, it often helps to take a look at how the hardware does things.
Most of the processors that are interesting candidates for dynamic recompilation organize their memory in pages, ie. the higher part of the address is the page number and the lower part is the page offset. The typical page size is 4K, which means that in a 32-bit address space you'd have 20-bit for the page number and 12-bit for the page offset. Even if you want to keep track of all possible pages (which isn't normally necessary) you'd need 1MB (= 20-bit address range) x 4 byte (size of an address on a 32-bit system) = 4MB, which might sound like much but actually keeps track of all pages in a 4GB address range. Now you still need 4KB x 4 byte = 16KB per page to have all the addresses, but you only need to keep track of pages that actually contain code, so that's far less than you might assume, and when you have a processor like MIPS where all instructions have to be aligned to 32-bit (ie. the start address of each instruction has the lower two bits cleared) it's only 4K.
When the emulation jumps to a certain address you first look at the page (by shifting the address right by 12 bit) to see if code in tat page has already been translated. If there is no translated code yet you allocate a new memory location that is large enough to hold all addresses for that page, enter a pointer to that area in the page number entry, and finally enter the address to the translated code in the location of the page offset at which the original code block starts.
The lookup needs just two memory accesses, one to find the find the location of the page directory via the page number and another to look up the address of the translated code via the page offset in the page directory.
I hope this doesn't sound too complicated, because it really isn't...
The paged transmap also allows for a solution to identify self-modifying code. Every time a write access is performed it is checked if the page number for that access indicates that code on that page has been translated, the cached code for that page is freed and the address in the page number entry cleared to force a recompilation of the code. This might sound crude, but in paged environments data and code are normally on different pages, so it is really likely that the code has been modified.
Since I was talking about Harvard architectures (ie. split integer and data caches) before, there is another way to detact self-modifying code in that case. Since those architectures have to flush their caches you can try to trace that (either some system call or the processor operation) and free the appropriate code that is no longer valid. According to an article (http://devworld.apple.com/technotes/pt/pt_39.html) the official 68K emulator for PowerMacs uses that solution.
Shiori
June 12th, 2002, 14:26
Originally posted by M.I.K.e7
I hope this doesn't sound too complicated, because it really isn't.
Shiori
June 12th, 2002, 14:28
Don't get me wrong, but it's really a good read. :) Makes me sound a LOT smarter when I read it aloud. :D
But are you sure you aren't involved in emu coding?
M.I.K.e7
June 12th, 2002, 14:46
Originally posted by Shiori
Don't get me wrong, but it's really a good read. :)
Well, cute little girls like the one in the picture shouldn't try to understand that anyway, maybe when she's a bit older ;-)
Makes me sound a LOT smarter when I read it aloud. :D
Haven't thought about that yet. Maybe I should try too? ;-)
But are you sure you aren't involved in emu coding?
So far I haven't worked on a single emulator, but I've done a lot of reading and had some discussions with Neil Bradley or Bart Trzynadlowski.
I started writing a dynarec some time ago, but stalled the project because I didn't like the way it turned out, and thought it might be a good idea to study some theories and rethink some of the design issues before I try it again.
So far I'm still in the theory phase, since I'm still not sure if I know enough about the topic, but sometimes I feel that I give it more thought than some programmers who actually wrote a complete dynarec.
M.I.K.e7
June 14th, 2002, 12:49
Who actually waited for the next article yesterday? ;)
Sorry about the delay, but I had to make up my mind what to cover next.
I think it's time to tackle the most difficult part of a dynarec, the translation.
You should already know that a block of the original code is translated to a block of host machine code.
To join the generated code with the emulator you also need some glue code to set up registers and write back the values after the block has been executed. For static register allocation you have something you could call a master block, since it does all the setup and cleaning for all generated blocks. With dynamic register allocation on the other hand you need a prologue and epilogue for each generated block, as register allocation can be different from block to block.
In following postings I'll disuss different forms of translation methods...
M.I.K.e7
June 14th, 2002, 13:42
Using direct translation the code block is processed linearly and each instruction is translated separately. Often the code that generates the translation is placed directly after the decoding of the instruction, ie. the instruction decode look very much like the one from an interpretive emulator, with the exception that machine code is generated instead of the instruction being simulated.
The advantage of that method is that you can simply transform an interpretive emulator into a dynamic recompilator.
But there are many disadavtages.
First of all, optimizing the code is very hard because you could only reasonable do a one or two instruction lookahead to test if the current instruction could be combined with the following ones for a better translation.
Retargetting the dynarec to a different host processor can be quite tedious, since you'll have to go through the whole decoder loop, which should actually be portable.
Unfortunately I see code like that much too often in real-world examples.
M.I.K.e7
June 14th, 2002, 14:25
Once I past the direct translation stage - I was heading in that direction and I didn't like it, which was one of the reasons why I stalled my dynarec and started research again - I thought of making the dynamic recompiler more portable.
Wouldn't it be cool to have some virtual intermediate processor (VIP, ie. the original code is translated to VIP "instructions", which are then translated to target instructions) and enable dynamic recompilation between a whole lot processors without too many translators (just two per processor, one that translates to VIP while the other translates from VIP)?
I thought so, and since I didn't know of the failed UNCOL project (they tried to define some kind of universal machine language and it didn't work out, but I guess I wouldn't have cared anyway) I started analyzing maybed dozens of processor architectures, and after several months I knew much more about several architectures, but I gave up on the VIP idea.
It is true that basically all that processors do is calculate, but there are surprisingly many differences...
Just take the number of logical instructions, where some architectures have just the standard ones while others have a whole lot combinations. Or the fact that some architectures handle the carry flag differently during subtraction (ie. they borrow) while others don't have any flags at all.
The biggest difference so probably the division, where some architectures also calculate the remainder, others have a separate instruction to calculate the reminder, and some require you to calculate the reminder via multiplication. Also some architectures produce results only on special registers, do division steps (ie. only calculate a certain amount of bits of the result per instruction), or don't have a division instruction at all.
If you also add strange instuctions like the "add and branch" from PA-RISC you get a whole lot of different instructions.
With the VIP there would be two extremes, either you end up having all possible instructions from all architectures you know (and there will be stil many you don't know), or you make it very simple and compose translations for more complex instructions with a sequence of VIP instructions. Either way it's bad:
When you make the VIP too complex porting will be very tedious since you'd have to provide translations for hundreds and hundreds VIP instructions for each new host processor and no one will want to do that, which nullifies the whole idea behind using the VIP in the first place.
If you make it simple, then it will be a charm to port, but the quality of the target code will be very bad, as it is very hard to optimize code, when even very simple original instructions end up as being a long sequence in VIP, eg. since many processors have different condition flags and some don't have any those couldn't be handled in a simple VIP, and you'd have to translate these into separate VIP instructions, which could transform a very simple ADD instruction into a long VIP sequence.
Finding a compromise between these extremes will be very hard and most likely result in narrowing the number of supported architectures, which again ruins the idea of using a VIP.
So that isn't really a good idea either...
M.I.K.e7
June 14th, 2002, 14:47
Ok, since the first two approaches to translation aren't that recommended, what else could be done?
In my opinion, the translator should make it easy to optimize translations and should be not too closely linked to the instruction decode to make porting easier.
The best solution is probably to generate a block decode structure, ie. as the instructions of a block are decoded the decode information is added to a structure (you might even add some additional information about register and flag use), which is then handed to the translator that is a totally different module.
This method is relatively fast, since you don't do several translations as with the VIP and lookaheads are much easier in pre-decoded block structure than in direct translation, which also leads to peephole-optimization being much easier.
You still have to write a special translator for each host system, but since the translator is a separate module that communicates with the decoder via a data structure porting is a much cleaner process. Not to mention that you are able to do very specific optimization, which probably would not be possible if you were using a stronger abstraction.
PNaveS
June 14th, 2002, 18:45
The ultimate tutorial for conceptual dynarec programming!!! Good job Mike! :)
YES, it will be great if you put these articles together in PDF and contribute it to emulation programming sites like "how to emulation" in emuhq.com!
M.I.K.e7
June 18th, 2002, 12:38
I wouldn't call it the ultimate tutorial, because you'll find other sources with much more practical information, simply because of my lack of experience in that field.
But you'd probably have to search hard to find a single source that discusses more of the different possible techniques, since I had some time to play with these possibilities in my mind.
Some of those possibilities are easy to reject of course, but others might have their use for some special cases, even if they don't seem to be ideal from a general point of view. That's why I think it's important to know many of the techniques to have alternatives if the best general approach doesn't suit the problem well.
M.I.K.e7
June 19th, 2002, 12:45
After some days without new information I think I should talk a little about code generation, since I don't want to dive into the platform specific translations yet.
One possiblity that is used by a few dynarecs are preassembled routines, ie. the covers (the code that represents a certain operation, in our case an instruction of the original code) are written in assembly (althought I know at least one example where they are hacked in hex code) and translated during the compilation of the emulator.
Those covers contain only placeholders where register references, addresses, or immediate values should be. After the whole cover has been copied to the target memory location the placeholders have to be patched with the appropriate values. Due to this fact the range of possible instruction formats is somewhat limited unless you want to make the whole process too complicated.
Obviously this method isn't that flexible and also won't lead to optimized code, as you have to translate each instruction all by itself instead of being able to combine two or three instructions to make a faster semantic translation of the code sequence.
Other methods typically use code emitter functions for covers. Those functions directly write exacutable instructions to a memory location. The difference here is how readable the code is. Often it's like this:
emit_4byte(0xE2800000 | 1); /* ADD R0, R0, #1 */
emit_4byte(0xE1A0F00E); /* MOV PC, LR */
Some readers might recognize the ARM example code I was using for the demonstration of the dynamic function call. I only changed it to show how values like the immediate #1 in the ADD instruction tend to be added to such code.
Of course this is already more flexible than preassembled code, but it isn't very readable and even harder to edit. Mind you that you won't even find the comments that tell you which instruction is generated in some example dynarecs I found.
To improve the last method you should use either functions or macros to make the code emitters more readable and also easier to use, which might look like this in the end:
emit_ADDI(REG0, REG0, 1); /* immediate addition */
emit_MOV(REG_PC, REG_LR); /* return */
With a few dozen code emitters like these you should be able to program and edit covers quite easily after some practice, and it is clear what they do even without the comments.
This is probably one of the last basics I can explain without using too much assembly. So are there any things you don't understand fully yet and that should be better explained before I continue?
What I have left now are topics about timing issues, specific translation problems, and code optimization, but those are very specific in most cases and maybe not that interesting for most. After that I'll probably have to pass or just answer questions, because I don't really have too much knowlegde of more in-depth stuff to continue. I didn't plan to fully explain a dynamic recompiler anyway ;)
N-Rage
June 22nd, 2002, 13:30
Compared to my Knowledge this is In-Depth already ;) . Very good read, this should definitly be a good Emu-Coder resource.
Not really a Question on the Dynarec Stuff, but do You know a table of the X86-instructions( & maybe SSE/ 3dnow/ MMX ) and their Hex-Codes?
n64warrior
June 22nd, 2002, 22:07
ok.than how whould i simulate a chip like the RCP which is 2 or more cpu's or gpu's or a cpu and a gpu in 1 chip with fixed instructions and formats?? thats what i realy need to know..
N-Rage
June 23rd, 2002, 19:54
Originally posted by n64warrior
ok.than how whould i simulate a chip like the RCP which is 2 or more cpu's or gpu's or a cpu and a gpu in 1 chip with fixed instructions and formats?? thats what i realy need to know..
You should have a Idea how to emulate a System or how it works or be able( or atleast TRY) to reverse it, if not JUST GIVE UP DAMMIT :rolleyes:
What do You want? someone who writes a Emulator for You if you ask the right Questions?
Hope this thread aint going down now :mad: ...
PS. Found the Hexcodes in a PDF Document at Intel`s
n64warrior
June 23rd, 2002, 22:58
i am not asking anyone to write an emulator for me . i just want to learn as munch infor. about emulation, so i can get tooie working.and to make a some ideas on how to emulate it.but still my main goal is get real good at programing software for cross systems.so that lunix and apple software can run on windows on a Pentium2 at good speed.and i also want to to make a xwindowsxp that whould be an open source project.but thats all later.for now i just want get good at n64 emulation.and at least make one good video plugin.
Raziel
June 24th, 2002, 00:28
Wow,so much projetcs :lol: .
M.I.K.e7
June 24th, 2002, 07:46
Originally posted by N-Rage
Compared to my Knowledge this is In-Depth already ;) . Very good read, this should definitly be a good Emu-Coder resource.
Everything is realtive of course. I still feel that I'm in need of a lot of research.
Not really a Question on the Dynarec Stuff, but do You know a table of the X86-instructions( & maybe SSE/ 3dnow/ MMX ) and their Hex-Codes?
Since the encoding of the x86 instructions is rather complicated, I doubt that there is a simple table for these.
Just a small example: "MOV EAX, EBX" and "MOV AX, BX" have the same encoding, the correct instruction can only identified by the contrext of which mode (32-bit or 16-bit) is currently in. If you mean the other instruction you have to use the prefix byte hex 66 before the encoded instruction.
I would have directed you to the official Intel Instruction Set Reference (http://developer.intel.com/design/pentium4/manuals/245471.htm), but it seems that you found that already.
I guess AMD will have something similar to find out more about 3DNow!, but I'm not totally sure.
M.I.K.e7
June 24th, 2002, 08:18
Originally posted by n64warrior
ok.than how whould i simulate a chip like the RCP which is 2 or more cpu's or gpu's or a cpu and a gpu in 1 chip with fixed instructions and formats?? thats what i realy need to know..
First of all, I meant this a general discussion about dynamic recompilation and I want to keep it clean from references to specific systems, which is why I won't turn this into a N64 emulation discusssion. If you want to know more about that you'd better open a new discussion thread and/or look for answers here:
http://www.classicgaming.com/epr/n64.htm
http://e64.wwemu.com/emus/1964/1964_recompiler_doc_101.pdf
http://1964emu.emulation64.com/
http://www.pj64.net/code/mainiframe_downloads.htm
Secondly, according to my information the RCP isn't a normal CPU but rather a video/audio chip combination, where dynamic recompilation doesn't make sense because there isn't any program code you could translate and cache.
You might try to cache display lists and there effect, but since display lists aren't necessarily linked to specific addresses unlike program code blocks so recognizing an already encountered display list will be slow, and not only due to this fact I assume that you won't get any speed increase with such a technique.
If you simply mean the synchronization of the CPU and the rest of the system, then I'll cover that topic in "Timing Issues" a bit later...
i am not asking anyone to write an emulator for me . i just want to learn as munch infor. about emulation, so i can get tooie working.and to make a some ideas on how to emulate it.
I guess you're just one of those guys who are pissed off that the only N64 emulator that is fast enough on your system to be playable is UltraHLE, but that uses lots and lots of dirty tricks to be able to run the games that fast at the expense of compatibility, which means that only few titles work.
Now you probably think that all the other N64 emulator authors just suck, and you'd be able to do a lightning fast emulator with top notch compatibility all by yourself, while you don't have enough knowledge to do that and the main problem simply is that your PC is too slow.
but still my main goal is get real good at programing software for cross systems.so that lunix and apple software can run on windows on a Pentium2 at good speed.and i also want to to make a xwindowsxp that whould be an open source project.but thats all later.
Good luck, but I guess you won't finish any of these projects in the next 5 or 10 years...
I had the idea of writing a VDM (Virtual DOS Machine) for BeOS to be able to run simple DOS tools under my favourite operating system. Apart from the use of the V86 mode of the processor this should be almost trivial compared to what you have on your mind, and I am still at the research stage after I thought of the VDM several months ago, and I'll probably drop the whole idea because of more important things I could do.
for now i just want get good at n64 emulation.and at least make one good video plugin.
It's certainly not the right thread to discuss that here...
M.I.K.e7
June 24th, 2002, 13:22
When it comes to handling multiple chip functions in emulation one might think with all the multi-tasking operating systems the different functions could run in parallel as separate threads. But it will be very hard to synchronize the timing between the part, which will either lead to weird effects or the emulation won't work well at all. The only thing that can probably run in a separate thread is the user interface IMO, where the system can buffer user input for later use.
So how does synchronization work in an emulator?
A traditional emulator "executes" one instruction after another, then checks the clock cycles it would have taken on the real machine. This is done for two reasons: first of all, when you know how much time it takes to perform the instruction on the real processor you can adjust the timing of the emulator that it runs just as fast as the real system, and secondly, after a certain amount of clock cycles are processed it is time to refresh the display, output some sound, etc. The amount of clock cycles that have to pass before the emulator has to process the other tasks totally depends on the emulated system, so I won't go into detail here.
In some rare cases, when the emulation has to be very exact, a so-called single-cycle emulation is used. In that case not the whole instrution is emulated, but only a small part of it during each clock cycle, to get the timing of the emulator as close to the original system as possible. This is very slow and complicated of course, and the total opposite to dynamic recompilation, which is meant to be fast and not totally exact. Single-cycle emulation is used be some C64 emulators, eg.
How does synchronization work in a dynamic recompiler?
Basically it's the same way, only that you execute blocks instead of single instructions or even single cycles, but you still count the clock cycles the same block would have needed on the real machine to synchronize the timing of the CPU emulation with the remaining emulation.
Depending how exact the emulation has to be, some blocks might spend too many clock cycles. If that happens you'll need to find out how many clock cycles can pass without running into a problem, and basically split the block into smaller parts during translation to stay below that cycle limit. This means that you can perform less optimizations because the blocks are smaller, and when you work with dynamic register allocation you have to generate prologue/epilogue code for each smaller block.
The most extreme solution that Neil Bradley came up with, if you should need very exact timing when using a dynarec, is that you basically have one original instruction per block. To keep the overhead low you should only do it with static register allocation, or you'll end up with more prologue/epilogue code than actual translations.
But instead of returning to the dispatcher loop as would be normal with translated blocks, you decrease a cycle counter and only jump to the exit code when the counter becomes negative. It looks a bit like this in x86 code:
; translated intruction here
sub ecx, cycles
js exitcode_1234
; next translated instruction
You need separate exit code for each instruction, but when you use static register allocation this won't be more than writing the current program counter address of the emulated system in an appropriate variable that the emulator knows where the dynarec left the block.
When you also use the paged translation map you can address each translated instrcution individually, so you don't have blocks anymore but linear code that can be left after any moment when it is time to do so.
One of the advantages is that you can actually make branches inside the generated code, because even when it would be an unlimited loop the execution thread would leave the block after a certain amount of clock cycles.
Since you'd most likely still translate block for block instead of the whole executable code (which is even impossible to find when you have indirect jumps), there are cases when you translate a block in the dispatcher that anther block just branched to. In that case you could patch the previous block not to leave to the dispatcher but jump to the newly translated block directly.
This whole idea - I hope I made it clear - should provide the best timing when using a dynamic recompiler, and it even has the advantage that it only returns to the dispatcher loop of the emulator when it has to due to clock cycles being "used up", but otherwise it runs totally on generated code.
No light without shadow, of course. The disadvantages are, you have to work with static recompilation (although some might not see that as a flaw) because the memory overhead for dynamic recompilation would be too big, and since you translate the instructions out of context you cannot use any optimization at all.
Use this technique if you really need instruction exact timing, otherwise it might be a bit extreme.
Nezzar
June 24th, 2002, 17:09
Do I see a new emulator coming? :p
n64warrior
June 24th, 2002, 22:34
my projects are learning how to do it now .my xwindowsxp that will start in 1 year or so .maybe less or maybe not.i will be useing the the xwindows source code to start the xwindowsxp project.as i said right now i am learning how to make my own video plugin in glide.
M.I.K.e7
June 25th, 2002, 11:19
Originally posted by Nezzar
Do I see a new emulator coming? :p
Not from me in the near future, because I still have way more interest in different processor architectures than graphics and sound.
But I think of writing a real document about the basics of dynamic recompilation, and I also have some changes and additions for http://www.dynarec.com/ in mind.
The problem is that I also should rework my German BeOS FAQ, and probably even spend my precious free time with more entertaining activities... So much to do and so little time...
M.I.K.e7
June 25th, 2002, 11:22
Originally posted by n64warrior
my projects are learning how to do it now .my xwindowsxp that will start in 1 year or so .maybe less or maybe not.i will be useing the the xwindows source code to start the xwindowsxp project.as i said right now i am learning how to make my own video plugin in glide.
Good luck with that, but it still has nothing to do with dynamic recompilation at all.
doggieiscool
January 7th, 2003, 00:03
this seems to be down all the time... dynamic.com that is
ShADoWFLaRe85
January 7th, 2003, 05:04
I find it amazing how one even finds the time to dig up such ancient threads. :p
doggieiscool
January 8th, 2003, 03:08
maybe cause i haven't been here long and i am intereted
master.
May 13th, 2009, 03:56
Thanks alot dude, i really appreciate that , by the way the site :http://www.dynarec.com/~mike/drfaq.html wont open! did you make it ? did you write your own dynarec document? if so i would be grateful if you share it with me .i really wana know more on the subject , and preferably using a simple example demonstrating the stuff !
Proto
May 13th, 2009, 04:15
You might want to check the date this thread was posted though. However, given that there were some other thread talking about this topic I guess this revival will help some people.
It was my topic, and I agree, it might be handy to keep a link to this thread in my thread. No need to let it get lost. :p
master.
May 13th, 2009, 12:09
by the way guys, can anyone give me the threads you are talking about ?
and to let you know, google brought me here! i didnt know we have such discussions happening here!
M.I.K.e7
October 31st, 2009, 23:10
Since there still seems to be some interest in this topic and this forum thread is the third hit when you search for "dynamic recompilation" on google, I guess I should add some more recent information.
First of all, dynarec.com is long dead. Neil Bradley didn't renew the domain registration, so instead it is used by one of those typical useless sites. Unfortunately all the material on there is gone too. I think I should have a backup of the FAQ somewhere, but it's not that impressive anymore.
I probably should combine the old FAQ and what I wrote here, make the text better and add some diagrams, but time is always an issue.
Meanwhile there are some interesting PDFs you might want to check out instead:
Study of the Techniques for Emulation Programming (http://people.ac.upc.edu/vmoya/docs/emuprog.pdf); like the name says, this document by Victor Moya covers a lot of topics around emulation, including binary translation. You might want to visit his page about emulation (http://people.ac.upc.edu/vmoya/emulation.html) as well.
David Sharp has written a report about Tarmac, a dynamically recompiling ARM emulator (http://davidsharp.com/tarmac/index.html), which was his university project. Parts of it were later used in the Archimedes/RiscPC emulator Red Squirrel. It's a bit older, but it should be interesting, especially for beginners.
A newer university report is Michael Steil's paper on Dynamic Re-compilation of Binary RISC Code for CISC Architectures (http://softpear.sourceforge.net/down/steil-recompilation.pdf).
I hope these resources provide more in-depth view of this topic.
Exophase
October 31st, 2009, 23:58
This new paper once again makes the same logical fallacy of applying the "90/10" rule to so-called "hotspot" recompilers. Even code that is executed 10% of the time will still quickly execute enough times to be recompiled. Comparing compile time to run time is really usually comparing an O(1) problem to an O(n) problem, where the latter never wins asymptotically. Translating code that only executes once or a few times doesn't hurt you very much for the precise reason that this code only executes once or a few times, and that these tend to be clustered inbetween real in-flight sequences where the program is transitioning anyway. This is especially true for games.
Hotspot recompilers are useful for decreasing memory used, although there are other techniques for doing this and on a modern OS/platform unused code blocks can eventually get swapped out anyway. They bring in the extra overhead of needing an interpreter in the first place, not to mention the storage for tracking block execution before the blocks are recompiled. I have yet to see solid empirical evidence that this approach gives an asymptotic performance advance, or even a better experience for the user... the only comparison I've seen is of Java's first JIT vs Hotspot, which is pretty flawed given the number of unrelated improvements that have been made then. It's also more of a big deal in JVM because Javac does almost no optimization and relies on the JVM to do a lot of expensive optimizations that you mostly wouldn't want to do in a normal recompiler targeting normal binaries that have been optimized by normal compilers.
M.I.K.e7
November 1st, 2009, 01:13
You're right, HotSpot techniques only save memory and maybe a little time, because they don't translate blocks that are only executed once. But since such blocks are most likely found in the initialization part of an application, it doesn't really matter that much, if the pure interpretation is faster or not.
As for the use of two emulation techniques in one emulator, most dynarec cores do have an interpretive brother, pretty much due to the fact that most programmers don't start out with the dynarec right away and like to test the result of the translated code against the interpretation. In some cases it makes sense to leave the interpreter in the release version as an option for software that doesn't run with the dynarec too well (eg. due to lots of self-modifying code), but I wouldn't use them at the same time either.
The JVM is a pet peeve of mine. Sun sell HotSpot as a great advantage, I see it as a correction of I big mistake. If they want to optimize the code produced by the JIT, they need to know the boundaries of the basic blocks. Unfortunately they don't mark the beginning of basic blocks in the byte code (despite the fact that the compiler is aware of them), so they have to search for those beginnings first (the end of a basic block is any jump or branch, of course).
The other problem with the JVM is that it's a stack machine. Virtually all CPUs nowadays are register machines. If they were using a register based VM, the JIT would just have to map the virtual registers to real ones, with occasional remapping, if there aren't enough registers. Instead they are using a VM technology from 1963 and either don't do any register allocation at all, or have to perform an expensive graph-coloring algorithm during runtime, which is normally done during the compiler run.
BTW, FX!32, a tool that statically recompiled x86 code on the Alpha version of Windows NT, first interprets the code, gathers profiling information, and launches the translator after the first run has been finished.
At first I thought that the profiler might search for a lot of information, but when I dug deeper to find more specific documentation, I found out that it's rather profane. It only records three types of data:
1. Jump target addresses, ie. basic blocks.
2. Unaligned memory accesses, because unlike on x86 they have to be handled specifically on Alpha.
3. Register indirect jumps, because they cannot be statically translated and you need to fallback to the interpreter for them.
So, in the end everything is about basic blocks, when you try to perform an optimized translation.
Exophase
November 1st, 2009, 04:54
You're right, HotSpot techniques only save memory and maybe a little time, because they don't translate blocks that are only executed once. But since such blocks are most likely found in the initialization part of an application, it doesn't really matter that much, if the pure interpretation is faster or not.
Yeah, good to see someone who agrees. It feels like a lot of dynarec papers are recycling the exact same material in the previous one, then adding new things. Not that that's so awful, but a few bad points keep being reiterated.
As for the use of two emulation techniques in one emulator, most dynarec cores do have an interpretive brother, pretty much due to the fact that most programmers don't start out with the dynarec right away and like to test the result of the translated code against the interpretation. In some cases it makes sense to leave the interpreter in the release version as an option for software that doesn't run with the dynarec too well (eg. due to lots of self-modifying code), but I wouldn't use them at the same time either.
Yeah, starting with an interpreter is definitely the way to go and I have for at least one of the emulators I've written with a recompiler. Ideally the interpreter should never be superior in release code - even very routine self-modifying code can be dealt with better than interpretation straight out most of the time, if you have a suitable strategy for dealing with it. On the other hand, if only very small regions are being modified (not entire blocks) you can black-list the opcodes and pass them through an interpreter, in which case it would make sense to have both active at the same time. Most games don't do things like this anymore, but I know I saw games on GBA that did.
Still, if you can manage to not include it in the release code it wouldn't hurt. Interpreters can be several thousand lines of code and I suppose several dozen KB.. which really isn't that much, but still. Probably nothing compared to any notable amount of recompiled code.
The JVM is a pet peeve of mine. Sun sell HotSpot as a great advantage, I see it as a correction of I big mistake. If they want to optimize the code produced by the JIT, they need to know the boundaries of the basic blocks. Unfortunately they don't mark the beginning of basic blocks in the byte code (despite the fact that the compiler is aware of them), so they have to search for those beginnings first (the end of a basic block is any jump or branch, of course).
I'm not that familiar with what the various JVM JIT strategies are like, at least not in detail. But it seems to me that JVM bytecode is going to be so structured and well behaved that you should be able to do compile methods at a time. Nothing's going to jump into the middle of the method, if's are only going to branch forward, loops only going to branch backwards, their starts won't overlap, and breaks go to the the same point where loops exit. This is especially true if the JVM marks these control structures clearly. It shouldn't have that much of a problem finding the exit points of a method, I would expect, since there has to be an instruction for that.
Then again, since you said beginnings of basic blocks aren't marked I'm going to guess that includes the beginnings of if statements and loops too. If so you're right, that is pretty silly.
The other problem with the JVM is that it's a stack machine. Virtually all CPUs nowadays are register machines. If they were using a register based VM, the JIT would just have to map the virtual registers to real ones, with occasional remapping, if there aren't enough registers. Instead they are using a VM technology from 1963 and either don't do any register allocation at all, or have to perform an expensive graph-coloring algorithm during runtime, which is normally done during the compiler run.
Being a stack machine makes it more compact, arguably. Of course, having to recompile in the first place throws that right out the window. Rather absurdly, a lot of embedded platforms don't even use JVM JIT and are relying on interpreters because of memory issues. So just where it needs the speed the most it doesn't get any. I don't think I have to tell you how little Java chip technologies caught on, despite being available in a lot of ARMs, which embedded platforms usually have.
And you're right, this means that JVM JIT is even more expensive than it has to be. But nevermind the lack of register allocation (to be fair, they'd have to redo that at runtime regardless because it's 100% dependent on the number of registers available). I'm more disgusted with the outright lack of optimizations entirely. Javac is barely more than a tokenizer - okay, yeah, it does flatten down expressions into stack format but it really does surprisingly little. I hear that you can get back perfectly genuine Java code from bytecode most of the time, hence the availability of Java obfuscating programs. So not only is Java forcing a lot of expensive optimizations on the JVM but it's giving it less information to work with with the bytecode format. The whole thing is a mess, and it's a good reason not to compare JVM JIT to recompilers - aside from being stack based like nothing is, like you mentioned, JIT really is more like a compiler than a recompiler. And it has given people the misconception that recompilers need a lot of time to do a good job.
BTW, FX!32, a tool that statically recompiled x86 code on the Alpha version of Windows NT, first interprets the code, gathers profiling information, and launches the translator after the first run has been finished.
At first I thought that the profiler might search for a lot of information, but when I dug deeper to find more specific documentation, I found out that it's rather profane. It only records three types of data:
1. Jump target addresses, ie. basic blocks.
2. Unaligned memory accesses, because unlike on x86 they have to be handled specifically on Alpha.
3. Register indirect jumps, because they cannot be statically translated and you need to fallback to the interpreter for them.
So, in the end everything is about basic blocks, when you try to perform an optimized translation.
So what does that tell us? FX!32 may have been better off being fully dynamic. People argue a lot about static recompilers because they think that'll produce better code and provide advantages over dynarecs.
Unaligned accesses can cause interrupts on most platforms where they're not supported outright, Alpha is no exception. Could have easily relied on this and patched the code at runtime as they occurred, or only after many of them occur from the same location.
One argument I see a lot is that static is preferable because you just have to load recompiled code from disk. Actually, it's entirely possible that loading code from disk will be slower than recompiling it (if you're in a scenario where you have to load the original no matter what, which is usually the case).
One other thing - people often assume that dynarec means basic block at a time. In reality, a dynarec can go as recursively deep as you want (my GBA dynarec will follow all unresolved branch targets, and I've never seen this taking too much time). Once you do this you can do some pretty good global-ish optimizations without actually tracking a lot of additional data nor adding very much additional time. For instance, if you track the liveness state of registers/flags at the beginning of blocks then you:
- scan a block, tracking all dependencies and modifications and branch targets as you go
- resolve all branch targets and recompile those blocks if they don't exist
- merge in liveness information from freshly recompiled blocks and perform liveness analysis
This is especially useful for dynamic register allocation, where you won't have to write back registers if they're dead at the start of the linked in block. This will for instance occur when calling functions that are modifying caller-save registers. It's also useful for dead-flag elimination, for instance if you have the following sequence
reg = reg - 1; //sets flags
if(reg != 0) goto label;
...
If it happens that all other flags are dead after the fallthrough it means only the zero flag would be live here.
The only problem is dealing with recursive links. IE, if one of the blocks you're going to recompile links to the block you just recompiled. You have to avoid the infinite recursion, obviously, but if you haven't finished recompiling the dependent blocks you'll have no way of knowing where to link to. I think a simple solution is to just keep a list of these unresolved links and go in and patch them in a second pass once your recompilation has exited the top level of the recursion, so you kind of end up with a compilation and linking stage, only the compilation does as much linking as it can directly. Oh, and if the recursive case happens you have to use really naive/pessimistic liveness info, but that's not a big deal (I mean, just call all the registers/flags live, or maybe do an early guess to determine what definitely isn't)
M.I.K.e7
November 1st, 2009, 13:37
Yeah, good to see someone who agrees. It feels like a lot of dynarec papers are recycling the exact same material in the previous one, then adding new things. Not that that's so awful, but a few bad points keep being reiterated.
It's like always, everyone is quoting everyone else. Remember when it was revealed that the amount of iron in spinach is much lower than originally thought? Someone made an error during the calculation and everyone else was quoting the result without rechecking it. People are just lazy, and that's also true for me. How often did I think that I should write a better introduction on dynamic recompilation, but I it was just too tempting to spend my spare time with less intellectual topics.
At least I try to keep an open mind. There is no pure right or wrong with this topic. That's why I tried to cover so many different options, when I initially started this thread. For example, I like dynamic register allocation, but if someone proves to me that I in his case static allocation is faster, then why should I argue against it? You make a dynarec because you want the speed, so speed almost always wins.
HotSpot probably is one of the few cases where a slight speed increase doesn't really justify to include an additional emulation engine.
BTW, I have to admit that I only browsed through Michael Steil's work, but wanted to include it because it at least looks interesting and it's good to provide different views.
I do know David and Victor, though. I've known David for several years before his project, and Neil Bradley started his dynarec mailing-list (not the Yahoo one) because he wanted to have a discussion with Victor and me at the same time. Victor would be the first to admit that his English isn't perfect, but I think his knowledge and ideas are so good to look past that.
On the other hand, if only very small regions are being modified (not entire blocks) you can black-list the opcodes and pass them through an interpreter, in which case it would make sense to have both active at the same time. Most games don't do things like this anymore, but I know I saw games on GBA that did.
Self-modifying code is mostly something from the "old ages". Today with split caches it's rather bad for the performance on most processors. A lot of 8-bit software seems to do it, but I don't really see the point of doing a dyanrec for an 8-bit system. Even ten years ago practically every half-decent computer was capable of running an 8-bit system full-speed in an interpreter. All those 8-bit dynarecs (I know of at least two or three) might be fun, but they aren't really worth the hassle. Not to mention that those systems often needed cycle-exact emulation, which is much easier to achieve with interpretation anyway.
I guess nowadays self-modifying code should be the biggest problem on embedded/handheld systems that hold the code on a media type where "execute in place" is possible (namely ROM or NOR Flash), but probably slower than RAM.
The GBA comes to mind, because it has 16-bit ROM access, but 32-bit work RAM. I've never programmed the GBA, but I would execute Thumb code from the ROM directly and copy the few ARM routines to the WRAM for speed. Otherwise you'd need two memory accesses for each ARM instruction, which practically nullifies the speed advantage.
I think I read a blog entry or version information about a GBA emulator, where self-modifying code was a problem in at least one case. Given your posting here, I wonder if that emulator could have been yours.
Of course, unless the system has lots of RAM and/or runs only one application, almost every modern system replaces code in RAM every now and then. But due to memory protection on those systems, code and data typically is in different memory pages. Thus it's quite easy to decide, when to remove some code from the cache: If a page that originally held code is written to, it's most likely that the code in that page was modified, so better recompile it.
Still, if you can manage to not include it in the release code it wouldn't hurt. Interpreters can be several thousand lines of code and I suppose several dozen KB.. which really isn't that much, but still. Probably nothing compared to any notable amount of recompiled code.
Most likely true. Very few software should have an initialization section of the complexity of an interpretive CPU emulator.
I'm not that familiar with what the various JVM JIT strategies are like, at least not in detail.
I have to admit that my information is pretty outdated by now, too. But I guess that due to the all-important compatibility those parts of the JVM most likely haven't really changed.
Then again, since you said beginnings of basic blocks aren't marked I'm going to guess that includes the beginnings of if statements and loops too. If so you're right, that is pretty silly.
I believe that's the case. But I have nothing against being proven wrong.
Being a stack machine makes it more compact, arguably.
Yes, the code density for stack architectures is better, simply because no operation needs to include arguments, they're simply on the stack.
But almost all newer VMs have switched to register architectures for speed, like the Lua VM or Parrot.
Rather absurdly, a lot of embedded platforms don't even use JVM JIT and are relying on interpreters because of memory issues. So just where it needs the speed the most it doesn't get any. I don't think I have to tell you how little Java chip technologies caught on, despite being available in a lot of ARMs, which embedded platforms usually have.
I do know of Jazelle, of course. But I'm not sure how fast it is compared to a JIT. From what I've read, some of the more complex opcodes have to be interpreted in software anyway.
Although I do actually work with embedded systems, I didn't have the need for Java yet. We still program all our controllers in C. Well, AVRs or PICs aren't really the best target for Java, and all the 32-bit systems ran with Linux.
But nevermind the lack of register allocation (to be fair, they'd have to redo that at runtime regardless because it's 100% dependent on the number of registers available).
I saw register allocation from a register VM to a real processor like page replacement. Allocate as many registers, as you can, by simply mapping the next virtual register to the next free hardware register. This will probably work for a high amount of the translation units. If you should run out of hardware registers, use a LRU or cheaper second-chance algorithm to replace one of the already mapped registers. What works for page replacement in virtual memory, could also work for dynamic register allocation.
I'm more disgusted with the outright lack of optimizations entirely. Javac is barely more than a tokenizer - okay, yeah, it does flatten down expressions into stack format but it really does surprisingly little. I hear that you can get back perfectly genuine Java code from bytecode most of the time, hence the availability of Java obfuscating programs.
Indeed, when you follow the stack progression, you can rebuild the original syntax tree, which is more or less the original code. All that is lacking, is names and comments.
This is a flaw of all stack-based machines, so Microsoft's .NET is no different here. While they made a few things different, they also stuck to the stack VM. Sun probably didn't know that they would be using a JIT compiler on the bytecode, but Microsoft knew from the start and should have learnt from Sun's errors, so shame on Microsoft.
So not only is Java forcing a lot of expensive optimizations on the JVM but it's giving it less information to work with with the bytecode format. The whole thing is a mess, and it's a good reason not to compare JVM JIT to recompilers - aside from being stack based like nothing is, like you mentioned, JIT really is more like a compiler than a recompiler. And it has given people the misconception that recompilers need a lot of time to do a good job.
The JVM is a bit of a mess anyway. From low-level operations to high-level object related stuff. You won't see this in a normal processor. On the other hand, a JVM won't have those timing issues or problems with trying to optimize condition code handling.
In a dynarec sometimes less is more. I'm a bit of a perfectionist (probably also one of the reasons why I don't release much stuff into the public), so I told David Sharp all sorts of optimizations he could use. I think in the end he used very few, because he found out that trying too hard to optimize everything took more time than the speed-up was worth it.
So what does that tell us? FX!32 may have been better off being fully dynamic. People argue a lot about static recompilers because they think that'll produce better code and provide advantages over dynarecs.
That's what I thought, too, until I read what the profiler actually does (ie. not much).
The other argument is, that they have most of the code pre-compiled in a database. But media access is much slower than RAM. So the question is, wouldn't it be faster to simply dynamically translate the code, than search a database for the right code segment and then load it? But that's more or less what you wrote further down as well.
I don't really see the point, also because you have to include the interpreter, because there are problems that cannot be solved during static recompilation.
I think there was a statically recompiling N64 emulator once (Corn?), but it just compiled all the code it found when the game was loaded. Yes, it was fast, but it never ran anything but Mario 64. And I bet there would be lots of code discovery problems with other ROMs.
Unaligned accesses can cause interrupts on most platforms where they're not supported outright, Alpha is no exception. Could have easily relied on this and patched the code at runtime as they occurred, or only after many of them occur from the same location.
But interrupts are expensive, and you have to attach your emulator quite deep in the system. Maybe they didn't want to do it.
One other thing - people often assume that dynarec means basic block at a time.
True, there is no theoretic limit how big your translation units are.
You could actually see the N64 emulator above as a dynarec instead, only that the translation unit was the whole code in the ROM.
I guess I'd normally prefer basic blocks, since you probably don't translate the same code twice, and you often return to the dispatcher loop, which makes timing easier. Of course the latter part makes the dynarec slower than it could be. There is always advantages and disadvantages.
Of course you could also resort to chaining basic blocks, like Shade (http://www.cs.washington.edu/research/compiler/papers.d/shade.html). Although in that case your register allocation advantage might not work.
The only problem is dealing with recursive links.
Yes, that could render an emulator rather unresponsive ;-)
For those who don't know the problem: To understand recursion, you have to understand recursion first...
Overall an interesting discussion, BTW.
I forgot to mention a book:
Virtual Machines (http://www.elsevierdirect.com/product.jsp?isbn=9781558609105) by Jim Smith and Ravi Nair. While the bulk is probably about VMs like VMware, a whole chapter is on emulation techniques (interpretation, threaded code, binary translation), and other chapters also cover some problems related to dynamic binary translation.
serge2k
November 6th, 2009, 05:49
I've been trying to get this tutorial to work
Tutorials - Introduction To Dynamic Recompilation (http://web.archive.org/web/20051018182930/www.zenogais.net/Projects/Tutorials/Dynamic+Recompiler.html)
but there seems to be some bugs. I've fixed them except for this
ModRM(0, DISP32, to);
which occurs in a few places, it doesn't match the written function. Documentation isn't all that clear on what the function does too.
Anyone know the answer?
Exophase
November 6th, 2009, 06:35
Hey serge. I was helping zenogais with dynarecs back before he wrote the tutorial, and I thought that he used code at some point that was based on some I wrote for him. But it looks like he used stuff from GoldFinger so I'm not really sure what he did with what I gave him ;p
Anyway, what immediately strikes as strange is that "register" is used as an identifier. Is that really okay in C++? I know C would have a fit since that's a keyword.
What do you mean when you say that the ModRM doesn't match the written function? I think to best understand this you need a thorough understanding of how x86 encoding works, which this document doesn't explain for you. Basically, the mod/rm byte in an instruction describes the registers. This page will help you:
sandpile.org -- IA-32 architecture -- 32bit mod R/M byte (http://www.sandpile.org/ia32/opc_rm32.htm)
The byte contains three fields: mod, r/m, and reg. Mod is mode, and describes the address mode. r/m refers to either a register or a memory location, depending on mode - it's also possible that it can refer to an extended mode. What reg refers to depends on the instruction: the list at the top of the link gives the possibilities, which include 8 different sets of registers. Another thing it can be, probably not as obvious, is where it refers to a sub operation - this is the case with ALU operations against immediates. Here the immediate has to be encoded additionally, so those bits are open and are used to determine what kind of ALU operation it is.
There are two examples of these extensions: if mod is 0 then the encoding for EBP (5) means to use [displacement32] instead of [EBP]. For modes 0 through 3 an encoding for ESP (6) means to use [SIB], where what that means is determined by another byte called the Scaled Index Byte. This is what the enums near the top of zenogais's code are for, and explain what those values of rm mean.
I do think that, like you said, there are errors. It appears that the code is confusing the r/m and reg fields. However, this is really only a semantic error, since he's storing what he calls register in the r/m field (lower 3bits) and r/m in what he calls rm (middle 3bits). So the errors basically cancel out, leaving him with code that probably worked. I didn't see any other logical errors in what he's doing.
Could you please elaborate a bit more on what's giving you difficulty?
serge2k
November 6th, 2009, 07:01
C++ has a fit too, I had to change that.
the function call doesn't match the header.
in a few places it makes calls like this
ModRM(0, to, DISP32)
where DISP32=5 and to is an X86RegisterType
but the function is written
void ModRM(unsigned char mod, unsigned char rm, X86RegisterType reg);
so I'm a bit confused. Thought maybe it was just a typo, but it doesn't work even if I call ModRM(0, DISP32,to).
so I'm not really sure where the problem is.
Guess I start reading that page now.
Exophase
November 6th, 2009, 15:44
When you say it doesn't work what do you mean exactly? That it doesn't produce the correct results or that it doesn't even compile?
X86RegisterType is an enum. In C you would be able to implicitly convert an integer to it, but in C++ you cannot. Maybe zenogais was adapting C code and tripped over it. He should have actually tested it, to say the least. As I explained to you, the code got the rm and reg fields swapped with each other, but it was consistent in how it output things. If you want to fix it you need to swap the names AND you need to swap how they're output by ModRM.
You could change X86RegisterType to unsigned char like the others. Casting from enums to integral types is no problem in C++.
But it sounds like you'd be better off just chucking this tutorial and figuring it out yourself. You're trying to fix something you don't really understand fundamentally and I don't see an awful lot of value in that.
serge2k
November 16th, 2009, 01:05
well I'm gong to take a look at the intel manuals but for now I have a question again, if I just call the Ret function shouldn't that work when I attempt to call the BlockFunction?
I run the program through the debugger and it just crashes as soon as it tries to call the code.
Exophase
November 16th, 2009, 20:45
well I'm gong to take a look at the intel manuals but for now I have a question again, if I just call the Ret function shouldn't that work when I attempt to call the BlockFunction?
I run the program through the debugger and it just crashes as soon as it tries to call the code.
You'll have to be more specific.
The first thing I recommend you do is dump your recompiled code buffer to a file and run it through a disassembler like objdump. This way you'll be able to verify what you have.
If it looks correct then you most likely either have a problem with how you're calling it or the code buffer is execute protected. If it's the latter then you'll have to make an OS call to turn off protection.
Paste more of what you're doing.
serge2k
November 16th, 2009, 22:41
I've attached all my code.
_Aphex_
February 17th, 2010, 01:06
This thread has been a great read, truly an invaluable resource. But anyway, I just wanted to give my thanks.
Cheers
_Aphex_
February 18th, 2010, 01:49
By the way, if anyone was getting an access violation from the OP's first Dynarec code, then here is the modified code that works. Credit given to Martyn.Rae on dreamincode.net (http://www.dreamincode.net/forums/index.php?showuser=266210)
#include <windows.h>
#include <stdio.h>
/* In the beginning we'll have to define the function pointer. */
/* I called the function 'dyncode' and gave it an int argument */
/* as well as an int return value just to show what's possible. */
int (*dyncode)(int); /* prototype for call of dynamic code */
/* The following char array is initialized with some binary code */
/* which takes the first argument from the stack, increases it, */
/* and returns to the caller. */
/* Just very simple code for testing purposes... */
unsigned char code[] = {0x8B,0x44,0x24,0x04, /* mov eax, [esp+4] */
0x40, /* inc eax */
0xC3 /* ret */
};
/* Include the prototypes of the functions we are using... */
int main()
{
/* To show you that the code can be dynamically generated */
/* although I defined static data above, the code is copied */
/* into an allocated memory area and the starting address is */
/* assigned to the function pointer 'dyncode'. */
/* The strange stuff in front of the malloc is just to cast */
/* the address to the same format the function pointer is */
/* definded with, otherwise you'd get a compiler warning. */
dyncode = (int (*)(int)) VirtualAlloc(NULL, sizeof(code),
MEM_COMMIT, PAGE_EXECUTE_READWRITE);
/* We now have a page of memory that is readable, writeable */
/* and executable. so the memcpy will work without any */
/* problems! */
memcpy(dyncode, code, sizeof(code));
/* To show that the code works it is called with the argument 41 */
/* and the retval sould be 42, obviously. */
/* This code will now execute correctly! */
printf("retval = %d\n", (*dyncode)(41) ); /* call the code and print the return value */
/* Freeing the page allocated. */
VirtualFree(dyncode, NULL, MEM_RELEASE);
return 0;
}
:thumb:
serge2k
February 18th, 2010, 09:54
By the way, if anyone was getting an access violation from the OP's first Dynarec code (On windows 7), then here is the modified code that works. Credit given to Martyn.Rae on dreamincode.net
#include <windows.h>
#include <stdio.h>
/* In the beginning we'll have to define the function pointer. */
/* I called the function 'dyncode' and gave it an int argument */
/* as well as an int return value just to show what's possible. */
int (*dyncode)(int); /* prototype for call of dynamic code */
/* The following char array is initialized with some binary code */
/* which takes the first argument from the stack, increases it, */
/* and returns to the caller. */
/* Just very simple code for testing purposes... */
unsigned char code[] = {0x8B,0x44,0x24,0x04, /* mov eax, [esp+4] */
0x40, /* inc eax */
0xC3 /* ret */
};
/* Include the prototypes of the functions we are using... */
int main()
{
/* To show you that the code can be dynamically generated */
/* although I defined static data above, the code is copied */
/* into an allocated memory area and the starting address is */
/* assigned to the function pointer 'dyncode'. */
/* The strange stuff in front of the malloc is just to cast */
/* the address to the same format the function pointer is */
/* definded with, otherwise you'd get a compiler warning. */
dyncode = (int (*)(int)) VirtualAlloc(NULL, sizeof(code),
MEM_COMMIT, PAGE_EXECUTE_READWRITE);
/* We now have a page of memory that is readable, writeable */
/* and executable. so the memcpy will work without any */
/* problems! */
memcpy(dyncode, code, sizeof(code));
/* To show that the code works it is called with the argument 41 */
/* and the retval sould be 42, obviously. */
/* This code will now execute correctly! */
printf("retval = %d\n", (*dyncode)(41) ); /* call the code and print the return value */
/* Freeing the page allocated. */
VirtualFree(dyncode, NULL, MEM_RELEASE);
return 0;
}
:thumb:
THANK YOU!
I had given up on getting this working. This works!
_Aphex_
February 19th, 2010, 03:21
Your welcome serge2k.
For anyone who is interested, on page 2, N-Rage asked for any x86 instruction set references. I found a couple of great references that give some very nice info on the opcodes and ISA. I hope it will help somebody.
8086 Opcode Map (http://mlsite.net/8086/)
x86 instruction set reference (http://siyobik.info/index.php?module=x86)
cottonvibes
February 19th, 2010, 04:46
hmm the second link is nice, added it to favorites; thanks _Aphex_ :p
been using this one for a while myself:
80386 Programmer's Reference Manual (http://www.itis.mn.it/linux/quarta/x86/index.htm)
Exophase
February 19th, 2010, 04:54
sandpile.org -- The world's leading source for pure technical x86 processor information. (http://www.sandpile.org)
_Aphex_
February 20th, 2010, 19:51
hmm the second link is nice, added it to favorites; thanks _Aphex_ :p
been using this one for a while myself:
80386 Programmer's Reference Manual (http://www.itis.mn.it/linux/quarta/x86/index.htm)
Thats alright, thanks for your link aswell. :thumb:
And thanks Exophase, your link is useful...very useful infact :D
cottonvibes
March 16th, 2010, 00:30
I wrote a blog post here about how dynarecs work:
[blog] Introduction to Dynamic Recompilation (http://forums.pcsx2.net/post-101560.html)
It also explains how a basic interpreter emulator works as well; and points out why dynarecs are advantageous.
Also there was way too much I had to leave out due to complexity and to keep it short. But hopefully some people find it useful...
Hatorijr
March 19th, 2010, 01:24
I know by the time i am done with this little test program i may hate myself and possibly my pc but i will be trying to convert the working code shown earlier for executing byte code in c++ into a C# program. sadly c# can not directly execute the code BUT i already found a way around the issue so once i am finished i will be posting up the code and everything so for all those who are interested, keep an eye out :D.
cottonvibes
March 19th, 2010, 07:26
interesting, i'd love to see the C# code when you're done.
Hatorijr
March 21st, 2010, 21:46
well i "would" be posting a nice example of my current theory program but i just found out i accidentally removed the c++ platform sdk that i thought i did not need and found out it took out my windows.h file so i am having to repair visual studio so i can compile my code :/.
Hatorijr
March 29th, 2010, 23:08
Sorry it has taken so long, with having to repair my visual studio and other things that took priority i do have a working example of my code but i am going to be fixing it up in the coming days so it does more than spit out static information lol.
blueshogun96
March 30th, 2010, 00:22
Just in case anyone would like to know, VMachine is a pretty good open source PC emulator that is very straight forward and very easy to learn from if you'd like to see dynarec in action.
It would be nice to see updates of this emulator, including CD-ROM support, improved SoundBlaster emulation (isn't working for me), RAM size greater than 16MB, improved VGA emulation and maybe some 3D accelerators. I can't remember if the PCI bus is emulated or not, but if it isn't, that would also be a nice touch. Aside from the features it lacks, most of it is pretty good. Emulates Windows 95 rather well. With some more updates and new features, this would be a rather awesome PC emulator.
VMachine: x86 PC Emulator for Windows - Paul's Projects (http://www.paulsprojects.net/vmachine/vmachine.html)
Hatorijr
April 3rd, 2010, 02:07
i just found this recently and i think it'd be an amazing help to anyone looking for info on x86 and x64 opcodes. It shows the opcodes for the instructions down to a very easy to understand level and at an advanced level and showed even what the byte codes for individual registers was as well.
X86 Opcode and Instruction Reference (http://ref.x86asm.net/)
serge2k
June 7th, 2010, 05:35
I'm trying to make the code work in linux now and I'm having issues.
I used mprotect to let me call the function without a segmentation fault but now I just keep getting zero back from the function.
Not really sure how I can debug this effectively
/* In the beginning we'll have to define the function pointer. */
/* I called the function 'dyncode' and gave it an int argument */
/* as well as an int return value just to show what's possible. */
int (*dyncode)(int); /* prototype for call of dynamic code */
/* The following char array is initialized with some binary code */
/* which takes the first argument from the stack, increases it, */
/* and returns to the caller. */
/* Just very simple code for testing purposes... */
unsigned char code[] = {0x8B,0x44,0x24,0x04, /* mov eax, [esp+4] */
0x40, /* inc eax */
0xC3 /* ret */
};
/* Include the prototypes of the functions we are using... */
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/mman.h>
int foo(int);
int main(void)
{
/* To show you that the code can be dynamically generated */
/* although I defined static data above, the code is copied */
/* into an allocated memory area and the starting address is */
/* assigned to the function pointer 'dyncode'. */
/* The strange stuff in front of the malloc is just to cast */
/* the address to the same format the function pointer is */
/* definded with, otherwise you'd get a compiler warning. */
dyncode = (int (*)(int)) malloc(sizeof(code) + 4095);
dyncode = (int (*)(int))(((long long)dyncode + 4095 & ~(4095)));
int t = mprotect(dyncode, sizeof(code), PROT_READ | PROT_WRITE | PROT_EXEC);
if(t == -1)
perror("test");
memcpy(dyncode, code, sizeof(code));
unsigned char *t2 = (unsigned char *)dyncode;
int i = 0;
for(i = 0; i < 10; i++)
{
printf("%x\n", *(t2+i));
}
/* To show that the code works it is called with the argument 41 */
/* and the retval sould be 42, obviously. */
int test = dyncode(41);
//always returns 0
printf("retval = %d %d %d\n", test, dyncode(42), (*dyncode)(43));/* call the code and print the return value */
return 0;
}
Exophase
June 8th, 2010, 02:18
Are you on x86-64?
serge2k
June 8th, 2010, 03:02
yes I had that thought last night. rather obvious in hingsight.
i was getting sick of ubuntu anyway so I just installed a 32 bit version of arch.
the above code works fine now.
serge2k
August 25th, 2010, 10:59
I just got my first opcode to translate.
addiu rt, rs, immediate
it's a mips r4400 CPU. I bet someone here can figure out what I'm trying to emulate ;)
Dax
August 25th, 2010, 12:10
I just got my first opcode to translate.
addiu rt, rs, immediate
it's a mips r4400 CPU. I bet someone here can figure out what I'm trying to emulate ;)
PSP? :p
motke
August 25th, 2010, 13:33
Nintendo 64
serge2k
August 25th, 2010, 13:49
PSP? :p
indeed.
I have the first block translated.
Dax
August 25th, 2010, 15:24
Nice one. HLE I assume?
vBulletin® v3.8.7, Copyright ©2000-2013, vBulletin Solutions, Inc.