|
|
|||||||
| Home | Register | Downloads | FAQ | Members List | Calendar | Arcade | Mark Forums Read |
» Less advertising throughout
» Post and participate in discussions
» Network with other forum members
» Free private messaging
![]() |
|
|
Thread Tools | Display Modes |
|
|
#1 |
|
Never give up your dreams
![]() ![]() Join Date: Dec 2008
Location: Brazil
Posts: 181
|
I have been reading some forums threads and most posts says that CUDA wouldn't help in the emulation scenary because the video card would be already stressed with the graphics emulation, so using CUDA, also, would not bring benefits. Well... basically, my idea is to use 2 video cards. One video card would emulate the CPU of the Xbox 360 with CUDA (or OpenCL), that would solve the hardware speed limitation to emulate the Xbox 360 CPU, as the GPU is 14x faster than a CPU! (Intel assumed that -> http://blogs.nvidia.com/2010/06/gpus...us-says-intel/) And the other video card would handle the graphics emulation. So the concept of the idea, as you can see, it's simple (I'm not saying that the implementation of this idea is simple too): one video card to emulate the CPU of Xbox 360 and the other one to emulate the GPU of the Xbox 360. I searched in some forums and found that we can choose the graphic card to use CUDA and other one to do something else (in my case, graphics emulation). It's just a configuration task. I also have read that GPU couldn't emulate the CPU at all, but I think this method could increase some speed in the process, sharing some emulation with cuda and some with the user CPU. I do want to build my own emulator, of course I won't start with Xbox 360, but my that's my main target. So... before I go forward, and invest money in this project... I want your opinion about this idea... specially from the experienced ones... Thank you in advance! PS: As far as I know, OpenCL is not as mature as CUDA, but comparing the code... they are really similar... so I can start developing the emulator using CUDA and then, when the OpenCL gets more mature, translate the code. It's more advantage to use OpenCL because it can be runned in both video cards: ATI and nVidia.
__________________
![]() |
|
|
|
| Advertisement | [Remove Advertisement] | ||
|
|
|
#2 | |
|
John
![]() ![]() ![]() ![]() ![]() ![]() ![]() Join Date: Nov 2007
Location: Scotland
Posts: 5,498
|
Quote:
__________________
Spoiler:
|
|
|
|
|
|
|
#3 | |
|
Never give up your dreams
![]() ![]() Join Date: Dec 2008
Location: Brazil
Posts: 181
|
Quote:
My real doubt is if this configuration would bring benefits in general, specially in the speed. Well... thank you anyway.
__________________
![]() |
|
|
|
|
|
|
#4 |
|
クロッスエクス
![]() ![]() ![]() ![]() ![]() ![]() Join Date: Mar 2006
Location: Argentina
Posts: 3,638
|
In my very minor understanding of things, I think for GPGPU to be useful it needs to be used on extremely parallelizable (is that a word?) workloads. Like rendering stuff, which is what GPUs do so well. I cannot imagine a point in an emulator (core?) that could benefit from that at all, specially with the latency of getting info back and forth from the GPU. |
|
|
|
|
|
#5 | |
|
Never give up your dreams
![]() ![]() Join Date: Dec 2008
Location: Brazil
Posts: 181
|
Quote:
Well, I don't know what are the techniques for emulating a CPU yet, that's why I came here, but... are you sure that in the CPU emulation there isn't any kind of stuff that could be done in parallel? Any kind of translation, calculation, or anything that need speed? My idea is not to emulate the CPU with ONLY CUDA, it's just to take the heavy part using CUDA. Because of that you pointed another possible problem: the latency. But can this latency between GPU and CPU really decrease the speed that we couldn't have much benefits by using this technology? Do you have any kind of real experience to share for proving me that it's not a good idea to use CUDA? Thank you!
__________________
![]() |
|
|
|
|
|
|
#6 |
|
クロッスエクス
![]() ![]() ![]() ![]() ![]() ![]() Join Date: Mar 2006
Location: Argentina
Posts: 3,638
|
|
|
|
|
|
|
#7 |
|
Never give up your dreams
![]() ![]() Join Date: Dec 2008
Location: Brazil
Posts: 181
|
hmm... for a moment I thought you're an emu author... anyway, thanks! ![]() Yeah... I thought in beginning with something really small like this. Thank you for helping... Anybody else have something to add? Any experience, statics, etc...?
__________________
![]() |
|
|
|
|
|
#8 |
|
Registered User
![]() ![]() ![]() Join Date: Aug 2008
Location: Northeast Ohio
Posts: 368
|
Just an FYI, to the end-user, OpenCL is much more preferred due to its vender neutral-ness.
__________________
emuforums' resident straight male kuutsundere. (that's a kuudere-tsundere hybrid; from left to right is the outermost layer to the innermost) CPU: Athlon 64 x2 4800+ (65nm G2) @ 1GHz/0.775v-3.1GHz/1.375v IGP: Radeon 4200 @ 500MHz GPU: EVGA GeForce 8800GS 384MB PCIe RAM: Corsair 4x1GB 667MHz OS1: Windows XP Pro SP3 32bit OS2: Windows 7 Ultimate SP1 64bit |
|
|
|
|
|
#9 | ||
|
Never give up your dreams
![]() ![]() Join Date: Dec 2008
Location: Brazil
Posts: 181
|
Quote:
Quote:
__________________
![]() |
||
|
|
|
|
|
#10 |
|
Registered User
![]() ![]() Join Date: Oct 2011
Posts: 189
|
There are some major issues with your idea of using a GPU to emulate a CPU... A GPU is good at dealing with parallel tasks, in a SIMD fashion (so doing the same things on each element of a vector), and generally speaking is mostly focused on math computations... If you disassemble .xex files, you'll see that games don't launch a bunch of threads or do things in parallels; they're like PC games (I guess I could say they *are* PC games), they have a main thread that does a lot of things, and some smaller threads for limited - asynchronous - tasks (network, sound, kinect...). Another problem is the kind of instructions a CPU does, as opposed to a GPU. If you take a simple instruction like RLWINM, which is basically a left rotation + applying some mask, this would be really painful to do with a GPU... Now obviously there are some things that can be done in parallel. For example, something like static recompilation would heavily benefit from being done in parallel, once you've isolated code. Another place where it could be useful is when dealing with reconstructing the operations that have to be done by the GPU (Xbox 360 games use Direct3D 9, but as a static library; you actually have a single function that receives a buffer containing all the instructions generated by the D3D9 code... including shaders, vertex/texture objects, etc). One last one I can think of (after thinking about it for at least half a second) is decoding the XMA streams, since those are normally decoded on dedicated hardware. Anyway at this point I guess it'd be more interesting for you to focus on getting something to work rather than thinking about optimizations this early in the process. If it works with a CPU, porting it to use a GPU won't be that big a deal, not to mention debugging your emulation is clearly easier with a CPU... |
|
|
|
|
|
#11 |
|
代言人
![]() ![]() ![]() ![]() ![]() ![]() ![]() Join Date: Dec 2006
Location: 應許之地
Posts: 7,056
|
GPU emulator will not work simply because all algorithms are not written for massively multi-threaded architectures. GPU path tracing looks a lot different from single threaded A* path tracing, though derived from A*. Main difference, GPU doesn't support recursion. In other words, the entire algorithm has to be translated, not just instructions. Try doing single threaded instruction on GPU, you will get like 1/100 the performance. GPU is only fast when you write an algorithm that is inherently parallelizable and executable on GPU. One example, you have to rewrite every single recursion into loops, if even possible. You also need to have spinning locks or atomic operations implemented to incur memory consistency. It's just gonna get ugly to force a single threaded algorithm into a massively-multithreded algorithm. If you port a CPU code to GPU straight and enable multi-threaded kernels, you are either gonna end up with memory hazard or 1/100 performance. Bucket sort is a great example for performance. If you just sort with each block and use atomic operation to collect, the performance will be terrible. You will need to combine parallel reduction and bubble sort to obtain optimal result on a GPU. An easier example, sum, is great for memory hazard. If you let each block take on an address and add onto one destination address with zero as starting value. The value in the address will be the value of the last accessed block. Again, you need to use parallel reduction, which takes log base 2 iteration of the size of the array. Not to mention the fact that it is hard to keep the entire grid saturated. You have order of execution priority. Even with asynchronous kernel calls, you get dependencies. The longest path determines the total execution time. Even if you can parallelize 90% of the instructions with only one series of instructions left. If that 10% of instruction takes more time to execute on slower cores (GPU), CPU is gonna beat the crap out of GPU in this case. Lastly, and most importantly, the most time consuming part of all GPU operation is memory transfer. You cannot just let the GPU do certain operation and pop the result back to CPU. CPU and GPU utilize different memories. For most operations not rendering related, by the time you pass the info down to the GPU, the CPU can already finish the task on its own. You either do everything on GPU or everything on CPU. All, GPU is awesome at tasks designed specifically for GPU. CPU should handle all the misc tasks, unless you have all the tasks well pipelined.
__________________
![]() Last edited by Fadingz; June 30th, 2012 at 22:00.. |
|
|
|
|
|
#12 |
|
Never give up your dreams
![]() ![]() Join Date: Dec 2008
Location: Brazil
Posts: 181
|
Well... OK then... really frustrating, but... are the real facts and I have to deal with it... -Ashe- and Fadingz... thank you for giving me those advice and information. Hope to find another way of developing an Xbox 360 emulator, I'm going to study, understand better emulation and how Xbox 360 works, etc... anyway I won't give up.
__________________
![]() |
|
|
|
|
|
#13 | |
|
Registered User
![]() ![]() ![]() ![]() ![]() Join Date: Dec 2007
Location: Australia
Posts: 1,317
|
Quote:
|
|
|
|
|
|
|
#14 |
|
代言人
![]() ![]() ![]() ![]() ![]() ![]() ![]() Join Date: Dec 2006
Location: 應許之地
Posts: 7,056
|
xbox utilizes a tri-core architecture. Here are my notes taken on xbox spec from my graduate class: (taken from Dr. Milo Martin's Lecture Slide, also available in the public domain) • ISA Extended with VMX-128 operations • 128 registers, 128-bits each • Packed “vector” operations • Example: four 32-bit floating point numbers • One instruction: VR1 * VR2 VR3 • Four single-precision operations • Also supports conversion to Microsoft DirectX data formats • Similar to Altivec (and Intel’s MMX, SSE, SSE2, etc.) • Works great for 3D graphics kernels and compression • Peak performance: ~75 gigaflops • Gigaflop = 1 billion floating points operations per second • Pipelined superscalar processor • 3.2 Ghz operation • Superscalar: two-way issue • VMX-128 instructions (four single-precision operations at a time) • Hardware multithreading: two threads per processor • Three processor cores per chip • Result: • 3.2 * 2 * 4 * 3 = ~77 gigaflops • ISA: 64-bit PowerPC chip • RISC ISA • Like MIPS, but with condition codes • Fixed-length 32-bit instructions • 32 64-bit general purpose registers (GPRs) Each Xenon chip: • 165 million transistors • IBM’s 90nm process • Three cores • 3.2 Ghz • Two-way superscalar • Two-way multithreaded • Shared 1MB cache Pipeline: • Four-instruction fetch • Two-instruction “dispatch” • Five functional units • “VMX128” execution “decoupled” from other units • 14-cycle VMX dot-product • Branch predictor: • “4K” G-share predictor • Unclear if 4KB or 4K 2-bit counters • Per thread Memory: • 128B cache blocks throughout • 32KB 2-way set-associative instruction cache (per core) • 32KB 4-way set-associative data cache (per core) • Write-through, lots of store buffering • Parity • 1MB 8-way set-associative second-level cache (per chip) • Special “skip L2” prefetch instruction • MESI cache coherence • Error Correcting Codes (ECC) • 512MB GDDR3 DRAM, dual memory controllers • Total of 22.4 GB/s of memory bandwidth • Direct path to GPU GPU parent die: • 232 million transistors • 500 Mhz • 48 unified shader ALUs • Mini-cores for graphics GPU daughter die: • 100 million transistors • 10MB eDRAM • “Embedded” • NEC Electronics • Anti-aliasing • Render at 4x resolution, then sample • Z-buffering • Track the “depth” of pixels • 256GB/s internal bandwidth
__________________
![]() Last edited by Fadingz; July 1st, 2012 at 10:05.. |
|
|
|
|
|
#15 |
|
Moving into the beat
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Join Date: May 2004
Location: Perpetual Hawaii
Posts: 11,271
|
Accelerating instruction execution is only viable when they already are properly emulated. CUDA could be useful for accelerating development processes and research, not execution. System compatibility would also suffer. What about users of discrete GPUs, ATI? Performance aside, the output couldn't even be guaranteed across hardware. |
|
|
|
|
|
#16 | |
|
Never give up your dreams
![]() ![]() Join Date: Dec 2008
Location: Brazil
Posts: 181
|
@Fadingz, again, thank you. I downloaded the PDF where did you get these information from. I will study it ![]() Quote:
__________________
![]() |
|
|
|
|
|
|
#17 |
|
Last Xbox Emu Author
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Join Date: Jun 2004
Location: Seattle, WA, USA
Posts: 5,843
|
Every time I see a post about Xbox 360 emu theory, the most important perspectives are missed... \/_\/ You guys ONLY focus on hardware and think almost nothing of the software aspect. Without that, you have nothing to go by. You can come up with a million and one ideas to emulate the hardware, or even manage to emulate the hardware perfectly, but if you insist on being clueless with the software side (i.e. BIOS, HDD format, Dashboard, .xex files, etc.) then you're wasting your time altogether because these things don't magically come documented or available either. Has the .xex file format been properly documented yet? How does the BIOS boot sequence go/work? What dashboards have been dumped? These are some of the things you have to think of before you can even think of touching anything hardware related. Another thing, while I did enjoy scanning Fadingz post on how the hardware interacts and what not (which is also important to know), what good is it if you don't have register level information on the chipsets? Where are the devices located in the 64GB address space (MMIO) or what ports are they mapped to? Do you have any information on the exclusive instructions the 360's CPU has? Have you thought about how the GPU can directly access video memory (seriously, that's f@#%ing scary!!!) and how you'd possibly emulate that? Do you even know what you need to emulate the vector units or what instruction set it's based off of?? What else is there to emulating sound besides the Sis audio chip (what DSPs are we dealing with)? How does the gamepad interact on a hardware level? Those are just a few obstacles to think of in the beginning stages. You can know what the hardware specs are, and still get nowhere fast if you don't know how it works on a byte level; knowing the hardware specs is only 20% of the battle at most. Of course, it's impossible to lay out all of the requirements in the beginning, but eventually you need to ask yourself how are you going to [eventually] document all this? Some things you'll need to have right from the start, other things you can eventually reverse engineer. As for finding register level documentation on these chipsets, some are easier to find than others. The GPU is based off of the ATI R600, and the specs are freely available as well as USB2.0 specs needed to emulate the gamepads and other input devices. There's even someone who's managed to write homebrew code to interface with the GPU on a low level, which can give you something to test on. The CPU is another story. It's common sense to find some doc on PPC64, but that's not enough to emulate it actually. Besides the fact that you have to emulate 3 of those suckers, what about the VUs? They use a special instruction set called AltiVec (IIRC). I searched the net for documentation on this, but haven't found s@#%. I did get my hands on a datasheet when I was working for Microsoft 2 years ago, but anything else I did have access to (which wasn't very much) was closely guarded and a breach of such information would not only get me fired but even blacklisted. Finding good and accurate information on the CPU is hard enough to come by. And I don't even want to think about threading... argh!
__________________
![]() Official Website of Shogun3D's RyuAwai! Shogun3D Game Development Blog Zengjük a Dalt: Manliest Song Ever! ![]() Last edited by blueshogun96; July 4th, 2012 at 09:58.. |
|
|
|
|
|
#18 |
|
Registered User
![]() ![]() ![]() Join Date: Jan 2012
Location: Australia
Posts: 433
|
holy shiz i got a long way to go till i make an emu.
__________________
You say I'm crazy? I say you're talking to a gumnut. |
|
|
|
|
|
#19 |
|
クロッスエクス
![]() ![]() ![]() ![]() ![]() ![]() Join Date: Mar 2006
Location: Argentina
Posts: 3,638
|
|
|
|
|
|
|
#20 |
|
Never give up your dreams
![]() ![]() Join Date: Dec 2008
Location: Brazil
Posts: 181
|
hmmm... I see your point, BlueShogun. I thank you by the information and by the kind of questions that I have to ask myself to start working on a emu... I know these things are really important... but let's not put the cart before the horse... I don't intend to build an Xbox 360 emulator right now, it was just an idea I have had a year ago and wanted to share now... I don't know when I'm going to step into Xbox 360 emulation, perhaps I can help you in Xbox 1 emulation first... I don't know yet. Any way, your reply was really useful. Thanks!
__________________
![]() |
|
|
|
![]() |
| Thread Tools | |
| Display Modes | |
|
|