Windows Phone Thoughts - Daily News, Views, Rants and Raves

Check out the hottest Windows Mobile devices at our Expansys store!


Digital Home Thoughts

Loading feed...

Laptop Thoughts

Loading feed...

Android Thoughts

Loading feed...




Go Back   Thoughts Media Forums > WINDOWS PHONE THOUGHTS > Windows Phone Hardware

Reply
 
Thread Tools Display Modes
  #1  
Old 12-06-2002, 11:15 AM
Andy Sjostrom
Pontificator
Join Date: Aug 2006
Posts: 1,177
Default Christmas Wish: A New Bus For My XScale

As we head into the Holiday Season I enjoy seeing more Pocket PC models based on the XScale processor making their entry into the market. But the new models remind me of the XScale discussions we had this summer, and I can't help re-visiting...

...Jason's post "XScale and the Pocket PC � what�s going on?" in which Ed Suwanjindar, from the Microsoft Mobile Devices group, responded to Jason's questions, and Chris De Herrera's article "Improving the Speed of XScale" in which Chris lists some recommendations on what Microsoft, Intel and Hardware Manufacturers should change.

I still felt I needed more details to understand what is really going on, and turned to Sven Myhre, CEO of Amazing Games. Sven is an extremely talented coder, artist, and 3D modeller, and I asked him the bottom line question: "Why is the XScale at 400 MHz sometimes faster, sometimes slower, and often exactly the same as a 206 MHz StrongARM CPU?" Read on to find out why I wish not for a faster processor or an optimized operating system, but a new bus!





Andy Sjostrom: "Why is the XScale at 400 MHz sometimes faster, sometimes slower, and often exactly the same as a 206 MHz StrongARM CPU?"

Sven Myhre: "In theory, a 400 MHz XScale will always be faster than a 206 MHz StrongARM. In real-life however, CPU performance depends on a lot more than raw MHz. The CPU needs to do something useful with all its raw speed � and that means we need to feed it with code instructions and data to process, and we need to make sure the result of all its processing is stored. This is where the memory bus comes into the picture.

XScale and StrongARM Pocket PC designs use a 16 bit memory bus. Both CPU families are 32 bit RISC processors that use 32 bit code instructions as well as (usually) 32 bit chunks of data. Just to feed the CPU with enough code instructions to keep it running at full speed we really need a memory bus running at twice the speed of the CPU, since the bus needs to transfer two 16 bit chunks to feed the CPU one 32 bit code instruction. And since we want the CPU to live a meaningful life, we also need to memory bus to transfer some data back and forth between the CPU and memory. For most applications, increasing the memory bus speed with another 25% should pretty much cover normal data traffic.

So, to keep our CPU running at full speed, our beloved Pocket PC should have a bus that runs at least 250 % faster than the CPU. A 400 MHz XScale should have a 1000 MHz bus, a 300 MHz XScale should have a 750 MHz bus, a 200 MHz XScale should have a 500 MHz bus and a 206 MHz StrongARM should have a 515 MHz bus..."

Reality Is Different

Sven continues. "In reality the speed factor between CPU and the memory bus is the opposite of what we just described. XScale Pocket PCs running at 400 MHz, 300 MHz and 200 MHz all use a 100 MHz bus, and the 206 MHz StrongARM use a 103 MHz bus.

I guess you just spotted the main bottleneck, and the reason why XScale running at 400 MHz, 300 MHz and 200 MHz get almost identical benchmark results for tests that involve shuffling memory around � typical applications are graphics and multimedia. (Note: some Pocket PCs incorporate graphics accelerators that might confuse this picture a little bit). And since the StrongARM use a bus that is 3 % faster than the bus used in XScale, we also find a logical explanation for why StrongARM based Pocket PC�s are sometimes slightly faster than XScale based Pocket PC�s in some tests."

Now It Gets Complicated

"Hardware designers knew they had to come up with a way to feed the CPU all the code instructions and data it needs, faster than the slow memory bus can provide them. So they added a cache to the CPU. The StrongARM has 8 Kb of code cache and 8 Kb of data cache, while the XScale use a 32 Kb code cache and a 32 Kb data cache. Whenever the CPU tries to load a code instruction or a chunk of data, it will first search its cache to see if it is already loaded. If it finds it in the cache, it can access the code instruction or data chunk at full speed. Bang! Your 400 MHz XScale roars, and will chew up code instructions at a blazing speed � 400 millions of them per second.

But what happens if the information the CPU is looking for is not found in cache? This brings us to the flipside of the cache � it becomes a double-edged sword that turns around and hits you hard if you don�t pay attention as a coder. Since the CPU needs to search the cache very quickly, the cache is organized into what we call cache lines. The cache line is in fact the smallest unit that can be read from memory under normal conditions. On both XScale and StrongARM, the cache line happens to be 16 words (in the realm of ARM architecture, a word equals 32 bit or 4 bytes). So a cache line is 64 bytes, and even if the CPU just need to access a single byte, it still has to read 64 bytes from memory to fill an entire cache line before returning with the single byte."

A Real World Example

"Joe Coder decides to make the worlds best PIM. He needs to store records (or structures) of all his contacts � and Joe Coder is popular so he has 1000 contacts. For each contact he needs to store their first name, surname and phone number, so he sets aside 64 bytes to store each contact. Then he wants to sort them by their surname and present them nicely on the screen. For each contact he probably just needs to read the first few letters in their surname in order to sort them correctly.

The problem is that even if Joe Coder just reads a few bytes from each contact record, the CPU will read 64 bytes from memory to the cache, every time he access a new surname. And if Joe Coder was a lazy coder, he might not have bothered to check that each record was aligned on 64 byte addresses � so a surname might actually span two cache lines, meaning the CPU will read 128 bytes for every access to a new surname. But even if we assume he did his homework and aligned the memory correct, a StrongARM will have used all of its data cache after reading just 128 surnames (8192 bytes / 64 bytes = 128 cache lines). An XScale would be able to fit 512 surnames (32768 bytes / 64 bytes = 512 cache lines) before it had to start writing over previously read cache lines. But Joe Coder needed to read through the entire list of 1000 contacts before starting over again � so neither the StrongARM nor the XScale would be able to use their cache to their advantage.

All Joe Coder wanted was to read 4 bytes from each surname, for a total of 4000 bytes. But the CPU ended up transferring a total of 64000 bytes from memory to the cache. A 206 MHz StrongARM would have spent 64000 cycles waiting, while a 400 MHz XScale would have spent 128000 cycles waiting. The deciding factor was the 103 MHz vs. 100 MHz bus, and the StrongARM would have been slightly faster.

Joe Coder made the cache design work against him. He forgot that a cycle is a terrible thing to waste. If Joe Coder had been clever, he might have reorganized his data structures. By storing all the surnames in a separate list, he could have made the cache work for him instead. Let us say he thinks 16 bytes are enough for a good surname, so 4 surnames would fit sequentially in a cache line (64 bytes). He would still have the penalty of waiting for the cache lines to fill up when he reads the first surname, but when he reads surname no 2, 3 and 4 - they would be present in the cache and he could have read them at full speed. So this time around, the CPU ended up transferring just 16000 bytes in total. And - if Joe Coder was lucky enough to own a 400 MHz XScale, they would all still be present in his cache when he finished - so he could go over them again - and this time they could all be accessed at full speed. Poor Joe Coder, however, he owns a StrongARM � so he still could not fit everything in the cache and the second run through them would take the same amount of time.

Joe Coder is faced with such dilemmas every day and the decisions he makes, have a huge impact on how your Pocket PC performs. Maybe the Joe Coder decides that an inefficient memory layout is the best way to go, since the code might be more easy to read and maintain - or that it has to be compatible width other versions of the software which runs on other platforms with other hardware constrains."

Bottom Line

"The main problem with slow XScales has nothing to do with XScale (which are based upon ARM v5) �emulating� StrongARM code (which is ARM v4) no more than you would say a Pentium 4 �emulates� a Pentium 3 when running Windows XP.

And it is NOT a question of simply �optimizing� Windows CE for XScale. Of course it might give you a few percentages faster code - but it�s not worth the trouble going through the entire Windows CE source code and check where we could reorganise structures or access patterns to make better use of the 32 Kb data cache on the XScale. We would probably end up with a highly unstable version of Windows CE were no one new the entire implications of all the changes they made.

Unless we get a faster and/or wider memory bus, we can increase the internal speed on the CPU to the speed of light (and it would probably be blazingly fast in calculating prime numbers or something) - but our real world applications would not really see the difference. As goes for purchase decisions � it is very much up to what you want your Pocket PC to do.

If you want to spend most time doing stuff that involves shuffling lots of memory around (typical use is graphics, multimedia, music and some games) you might find that a 300 MHz XScale gives you just as much bang for the buck as a 400 MHz. But please note that this will change from application to application. Sometimes you can blame Joe Coder, but at other times the datasets are just too big fit any cache."

The Horizon

"The most exiting news with the launch of the XScale family was an extension called Wireless MMX, which lets the code perform operations commonly used in multimedia processing on several data units simultaneously. Right now there are few (if any) tools available to the developer community to take advantage of this extension. But Intel�s upcoming C/C++ compiler (currently in beta) for XScale includes functionality to access of Wireless MMX from high-level C/C++ code without resorting to assembler."
 
Reply With Quote
  #2  
Old 12-06-2002, 11:48 AM
MacBirdie
Ponderer
Join Date: Jun 2003
Posts: 64
Send a message via Skype™ to MacBirdie

A 200MHz bus would do I think. But for now it's overclocking time :twisted: :twisted:
 
Reply With Quote
  #3  
Old 12-06-2002, 01:06 PM
Mr. Anonymous
Ponderer
Join Date: Jul 2003
Posts: 95

Great article!

One question: would increasing the bus speed increase power usage? I'd think it would...
 
Reply With Quote
  #4  
Old 12-06-2002, 01:40 PM
Oliver T
Ponderer
Join Date: Jun 2002
Posts: 56

And even IF you get a faster bus to your memory, all access to your CF or SD card is going via a serial interface which means you have to start shuffling around even more (SD or CF -> main memory, then access it there).
 
Reply With Quote
  #5  
Old 12-06-2002, 01:56 PM
enemy2k2
Theorist
Join Date: Apr 2004
Posts: 268

This was an EXCELLENT article, and just what I would like to see from this site! Thank you Andy! THis is also the reason I placed an order on the 300Mhz Axim rather than the 400 even though the clock difference is substantial. 33% speed increase is unlikely, even half that would be pushing it. I'm still estimating about 10% at most. If anyone thinks I'm wrong I wouldn't mind knowing why. These are just my estimates. Hopefully the next iteration of XScale, or whatever CPU is in fashion at that time, will make provisions for lower volatage higher speed bus. DDR would be mandatory I should think. 16 bit bus is pathetic, that's the first thing that needs to be addressed. Second would be speed, it should be at least half of the slowest popular processor or even better half the fastest.
 
Reply With Quote
  #6  
Old 12-06-2002, 02:00 PM
sponge
Philosopher
Join Date: Jul 2003
Posts: 541

It's nice to see someone with a defeinitive, confident, and extended answer, because no one knows what the real problem is. Many kudos out to Amazing Games/Sven for really getting into this.

With that said, I've only read a little about Wireless MMX. On my past computers, I've found that most of these optimizations really don't do much, with the exception of the SSE in the P4s. Since Pocket PCs are relatively low powered devices, are we actually going to see improvments in performance with Wireless MMX programs?
 
Reply With Quote
  #7  
Old 12-06-2002, 02:17 PM
jtallon
Neophyte
Join Date: Apr 2002
Posts: 5

Here's a stupid question - why DON'T we have faster bus speeds on the new pocket PC's ? One would think Intel would have a recommended chipset to support their Xscale processor, and that the recommended chipset would include a bus fast enough to make the processor look good...
 
Reply With Quote
  #8  
Old 12-06-2002, 02:20 PM
ChezDoodles
Pupil
Join Date: Jul 2003
Posts: 12
Default Why not a faster bus?

The memory bus controller is integrated into the XScale and locked to 100 MHz. Nothing we can do about it. However, the controller supports both 16 bit and 32 bit memory.
 
Reply With Quote
  #9  
Old 12-06-2002, 03:18 PM
Bob Anderson
Thinker
Bob Anderson's Avatar
Join Date: Feb 2004
Posts: 338

Andy, thanks for taking the time to get "the" answer from Sven! I'm so pleased that we got a great deal of detail on the subject; it is very helpful.

What I'm struggling with, is why would Intel / Microsoft support a transition to a new processor (Xscale) and the millions of dollars of R&D both companies undoubtedly invested, if, in the end... nothing will be different. Or said another way, why spend all that money for only a mediocre performance :?:

I don't like to be seen as a Microsoft or Intel "crusader", but honestly, from a business perspective it just seems like the companies were totally wasting their of money - if there isn't some type of benefit (maybe that's what you can do when you have spare billions laying around :wink: ). Which leads me to my next comment - I know the article stated that "optimizing CE" isn't "the answer" and I agree... but what about a new Pocket PC operating system, say, CE4 that was probably built with Xscale in mind?

And my final thought on this is... in order to gain full use of Xscale processors, are we going to have to lose backward compatibility with StrongArm? Is the potential next version of PocketPC OS going to have to be so different that, while it *may* run faster we'll lose compatibility with existing apps?
 
Reply With Quote
  #10  
Old 12-06-2002, 03:33 PM
Pony99CA
Swami
Pony99CA's Avatar
Join Date: May 2004
Posts: 4,396
Default Overclocking

Quote:
Originally Posted by MacBirdie
A 200MHz bus would do I think. But for now it's overclocking time :twisted: :twisted:
Will overclocking make the bus perform faster, too? (Hey, I'm a software guy, not a hardware guy. :-))

Steve
 
Reply With Quote
Reply


Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +1. The time now is 09:56 PM.