Log in

View Full Version : Christmas Wish: A New Bus For My XScale


Andy Sjostrom
12-06-2002, 11:15 AM
As we head into the Holiday Season I enjoy seeing more Pocket PC models based on the XScale processor making their entry into the market. But the new models remind me of the XScale discussions we had this summer, and I can't help re-visiting...<br /><br />...Jason's post <a href="http://www.pocketpcthoughts.com/forums/viewtopic.php?t=1772">"XScale and the Pocket PC – what’s going on?"</a> in which Ed Suwanjindar, from the Microsoft Mobile Devices group, responded to Jason's questions, and Chris De Herrera's article <a href="http://www.cewindows.net/commentary/xscale_speed.htm">"Improving the Speed of XScale"</a> in which Chris lists some recommendations on what Microsoft, Intel and Hardware Manufacturers should change.<br /><br />I still felt I needed more details to understand what is really going on, and turned to Sven Myhre, CEO of <a href="http://amazinggames.com/">Amazing Games</a>. Sven is an extremely talented coder, artist, and 3D modeller, and I asked him the bottom line question: <b>"Why is the XScale at 400 MHz sometimes faster, sometimes slower, and often exactly the same as a 206 MHz StrongARM CPU?"</b> Read on to find out why I wish not for a faster processor or an optimized operating system, but a new bus!<br /><br /><img src="http://www.pocketpcthoughts.com/images/bus_01.jpg" /><br /><br /><!><br /><br /><b>Andy Sjostrom:</b> <i>"Why is the XScale at 400 MHz sometimes faster, sometimes slower, and often exactly the same as a 206 MHz StrongARM CPU?"</i><br /><br /><b>Sven Myhre:</b> "In theory, a 400 MHz XScale will always be faster than a 206 MHz StrongARM. In real-life however, CPU performance depends on a lot more than raw MHz. The CPU needs to do something useful with all its raw speed – and that means we need to feed it with code instructions and data to process, and we need to make sure the result of all its processing is stored. This is where the memory bus comes into the picture.<br /><br />XScale and StrongARM Pocket PC designs use a 16 bit memory bus. Both CPU families are 32 bit RISC processors that use 32 bit code instructions as well as (usually) 32 bit chunks of data. Just to feed the CPU with enough code instructions to keep it running at full speed we really need a memory bus running at twice the speed of the CPU, since the bus needs to transfer two 16 bit chunks to feed the CPU one 32 bit code instruction. And since we want the CPU to live a meaningful life, we also need to memory bus to transfer some data back and forth between the CPU and memory. For most applications, increasing the memory bus speed with another 25% should pretty much cover normal data traffic.<br /><br />So, to keep our CPU running at full speed, our beloved Pocket PC should have a bus that runs at least 250 % faster than the CPU. A 400 MHz XScale should have a 1000 MHz bus, a 300 MHz XScale should have a 750 MHz bus, a 200 MHz XScale should have a 500 MHz bus and a 206 MHz StrongARM should have a 515 MHz bus..."<br /><br /><span><b>Reality Is Different</b></span><br /><br />Sven continues. "In reality the speed factor between CPU and the memory bus is the opposite of what we just described. XScale Pocket PCs running at 400 MHz, 300 MHz and 200 MHz all use a 100 MHz bus, and the 206 MHz StrongARM use a 103 MHz bus.<br /><br />I guess you just spotted the main bottleneck, and the reason why XScale running at 400 MHz, 300 MHz and 200 MHz get almost identical benchmark results for tests that involve shuffling memory around – typical applications are graphics and multimedia. <i>(Note: some Pocket PCs incorporate graphics accelerators that might confuse this picture a little bit).</i> And since the StrongARM use a bus that is 3 % faster than the bus used in XScale, we also find a logical explanation for why StrongARM based Pocket PC’s are sometimes slightly faster than XScale based Pocket PC’s in some tests."<br /><br /><span><b>Now It Gets Complicated</b></span><br /><br />"Hardware designers knew they had to come up with a way to feed the CPU all the code instructions and data it needs, faster than the slow memory bus can provide them. So they added a cache to the CPU. The StrongARM has 8 Kb of code cache and 8 Kb of data cache, while the XScale use a 32 Kb code cache and a 32 Kb data cache. Whenever the CPU tries to load a code instruction or a chunk of data, it will first search its cache to see if it is already loaded. If it finds it in the cache, it can access the code instruction or data chunk at full speed. Bang! Your 400 MHz XScale roars, and will chew up code instructions at a blazing speed – 400 millions of them per second.<br /><br />But what happens if the information the CPU is looking for is not found in cache? This brings us to the flipside of the cache – it becomes a double-edged sword that turns around and hits you hard if you don’t pay attention as a coder. Since the CPU needs to search the cache very quickly, the cache is organized into what we call cache lines. The cache line is in fact the smallest unit that can be read from memory under normal conditions. On both XScale and StrongARM, the cache line happens to be 16 words (in the realm of ARM architecture, a word equals 32 bit or 4 bytes). So a cache line is 64 bytes, and even if the CPU just need to access a single byte, it still has to read 64 bytes from memory to fill an entire cache line before returning with the single byte."<br /><br /><span><b>A Real World Example</b></span><br /><br />"Joe Coder decides to make the worlds best PIM. He needs to store records (or structures) of all his contacts – and Joe Coder is popular so he has 1000 contacts. For each contact he needs to store their first name, surname and phone number, so he sets aside 64 bytes to store each contact. Then he wants to sort them by their surname and present them nicely on the screen. For each contact he probably just needs to read the first few letters in their surname in order to sort them correctly.<br /><br />The problem is that even if Joe Coder just reads a few bytes from each contact record, the CPU will read 64 bytes from memory to the cache, every time he access a new surname. And if Joe Coder was a lazy coder, he might not have bothered to check that each record was aligned on 64 byte addresses – so a surname might actually span two cache lines, meaning the CPU will read 128 bytes for every access to a new surname. But even if we assume he did his homework and aligned the memory correct, a StrongARM will have used all of its data cache after reading just 128 surnames (8192 bytes / 64 bytes = 128 cache lines). An XScale would be able to fit 512 surnames (32768 bytes / 64 bytes = 512 cache lines) before it had to start writing over previously read cache lines. But Joe Coder needed to read through the entire list of 1000 contacts before starting over again – so neither the StrongARM nor the XScale would be able to use their cache to their advantage.<br /><br />All Joe Coder wanted was to read 4 bytes from each surname, for a total of 4000 bytes. But the CPU ended up transferring a total of 64000 bytes from memory to the cache. A 206 MHz StrongARM would have spent 64000 cycles waiting, while a 400 MHz XScale would have spent 128000 cycles waiting. The deciding factor was the 103 MHz vs. 100 MHz bus, and the StrongARM would have been slightly faster.<br /><br />Joe Coder made the cache design work against him. He forgot that a cycle is a terrible thing to waste. If Joe Coder had been clever, he might have reorganized his data structures. By storing all the surnames in a separate list, he could have made the cache work for him instead. Let us say he thinks 16 bytes are enough for a good surname, so 4 surnames would fit sequentially in a cache line (64 bytes). He would still have the penalty of waiting for the cache lines to fill up when he reads the first surname, but when he reads surname no 2, 3 and 4 - they would be present in the cache and he could have read them at full speed. So this time around, the CPU ended up transferring just 16000 bytes in total. And - if Joe Coder was lucky enough to own a 400 MHz XScale, they would all still be present in his cache when he finished - so he could go over them again - and this time they could all be accessed at full speed. Poor Joe Coder, however, he owns a StrongARM – so he still could not fit everything in the cache and the second run through them would take the same amount of time.<br /><br />Joe Coder is faced with such dilemmas every day and the decisions he makes, have a huge impact on how your Pocket PC performs. Maybe the Joe Coder decides that an inefficient memory layout is the best way to go, since the code might be more easy to read and maintain - or that it has to be compatible width other versions of the software which runs on other platforms with other hardware constrains."<br /><br /><span><b>Bottom Line</b></span><br /><br />"The main problem with slow XScales has nothing to do with XScale (which are based upon ARM v5) “emulating” StrongARM code (which is ARM v4) no more than you would say a Pentium 4 “emulates” a Pentium 3 when running Windows XP.<br /><br />And it is NOT a question of simply “optimizing” Windows CE for XScale. Of course it might give you a few percentages faster code - but it’s not worth the trouble going through the entire Windows CE source code and check where we could reorganise structures or access patterns to make better use of the 32 Kb data cache on the XScale. We would probably end up with a highly unstable version of Windows CE were no one new the entire implications of all the changes they made.<br /><br />Unless we get a faster and/or wider memory bus, we can increase the internal speed on the CPU to the speed of light (and it would probably be blazingly fast in calculating prime numbers or something) - but our real world applications would not really see the difference. As goes for purchase decisions – it is very much up to what you want your Pocket PC to do.<br /><br />If you want to spend most time doing stuff that involves shuffling lots of memory around (typical use is graphics, multimedia, music and some games) you might find that a 300 MHz XScale gives you just as much bang for the buck as a 400 MHz. But please note that this will change from application to application. Sometimes you can blame Joe Coder, but at other times the datasets are just too big fit any cache."<br /><br /><span><b>The Horizon</b></span><br /><br />"The most exiting news with the launch of the XScale family was an extension called Wireless MMX, which lets the code perform operations commonly used in multimedia processing on several data units simultaneously. Right now there are few (if any) tools available to the developer community to take advantage of this extension. But Intel’s upcoming C/C++ compiler (currently in beta) for XScale includes functionality to access of Wireless MMX from high-level C/C++ code without resorting to assembler."

MacBirdie
12-06-2002, 11:48 AM
A 200MHz bus would do I think. But for now it's overclocking time :twisted: :twisted:

Mr. Anonymous
12-06-2002, 01:06 PM
Great article!

One question: would increasing the bus speed increase power usage? I'd think it would...

Oliver T
12-06-2002, 01:40 PM
And even IF you get a faster bus to your memory, all access to your CF or SD card is going via a serial interface which means you have to start shuffling around even more (SD or CF -> main memory, then access it there).

enemy2k2
12-06-2002, 01:56 PM
This was an EXCELLENT article, and just what I would like to see from this site! Thank you Andy! THis is also the reason I placed an order on the 300Mhz Axim rather than the 400 even though the clock difference is substantial. 33% speed increase is unlikely, even half that would be pushing it. I'm still estimating about 10% at most. If anyone thinks I'm wrong I wouldn't mind knowing why. These are just my estimates. Hopefully the next iteration of XScale, or whatever CPU is in fashion at that time, will make provisions for lower volatage higher speed bus. DDR would be mandatory I should think. 16 bit bus is pathetic, that's the first thing that needs to be addressed. Second would be speed, it should be at least half of the slowest popular processor or even better half the fastest.

sponge
12-06-2002, 02:00 PM
It's nice to see someone with a defeinitive, confident, and extended answer, because no one knows what the real problem is. Many kudos out to Amazing Games/Sven for really getting into this.

With that said, I've only read a little about Wireless MMX. On my past computers, I've found that most of these optimizations really don't do much, with the exception of the SSE in the P4s. Since Pocket PCs are relatively low powered devices, are we actually going to see improvments in performance with Wireless MMX programs?

jtallon
12-06-2002, 02:17 PM
Here's a stupid question - why DON'T we have faster bus speeds on the new pocket PC's ? One would think Intel would have a recommended chipset to support their Xscale processor, and that the recommended chipset would include a bus fast enough to make the processor look good...

ChezDoodles
12-06-2002, 02:20 PM
The memory bus controller is integrated into the XScale and locked to 100 MHz. Nothing we can do about it. However, the controller supports both 16 bit and 32 bit memory.

Bob Anderson
12-06-2002, 03:18 PM
Andy, thanks for taking the time to get "the" answer from Sven! I'm so pleased that we got a great deal of detail on the subject; it is very helpful.

What I'm struggling with, is why would Intel / Microsoft support a transition to a new processor (Xscale) and the millions of dollars of R&D both companies undoubtedly invested, if, in the end... nothing will be different. Or said another way, why spend all that money for only a mediocre performance :?:

I don't like to be seen as a Microsoft or Intel "crusader", but honestly, from a business perspective it just seems like the companies were totally wasting their of money - if there isn't some type of benefit (maybe that's what you can do when you have spare billions laying around :wink: ). Which leads me to my next comment - I know the article stated that "optimizing CE" isn't "the answer" and I agree... but what about a new Pocket PC operating system, say, CE4 that was probably built with Xscale in mind?

And my final thought on this is... in order to gain full use of Xscale processors, are we going to have to lose backward compatibility with StrongArm? Is the potential next version of PocketPC OS going to have to be so different that, while it *may* run faster we'll lose compatibility with existing apps?

Pony99CA
12-06-2002, 03:33 PM
A 200MHz bus would do I think. But for now it's overclocking time :twisted: :twisted:
Will overclocking make the bus perform faster, too? (Hey, I'm a software guy, not a hardware guy. :-))

Steve

PhatCohiba
12-06-2002, 03:35 PM
When the rebuild the bus to support faster speeds, how about adding native USB Host support rather then just USB client.

Then it would be able to use all those kewl USB peripherals.

-John

Yuta
12-06-2002, 03:59 PM
The memory bus controller is integrated into the XScale and locked to 100 MHz. Nothing we can do about it. However, the controller supports both 16 bit and 32 bit memory.

That's what I was trying to tell people. It OEM's fault!.
http://discussion.brighthand.com/showthread.php?s=&threadid=64544

mookie123
12-06-2002, 04:07 PM
This is what the Linux maintainer do to handle X-scale situation.
Apparently there is some sense in dual built, contrary to general opinion made by Microsoft PR.
----------------------

The xscale is instruction-set compatible with strongarm, but some things that perform well on strongarm will perform badly on the xscale. In particular, multi-word load and store instructions run slowly on xscale. These are used in the standar procedure call prolog and epilog, so all user-mode applications will have to be recompiled in order to perform well on xscale. Both Debian and Handhelds.org have plans to do this for their respective distributions. It is possible that xscale binaries will perform will in both time and space and so we can standardize on binaries optimized for xscale. If not, we will perform dual builds in order to support both strongarm and xscale.

http://www.handhelds.org/projects/h3900.html

Sven Johannsen
12-06-2002, 04:07 PM
First, I'm not that Sven :) All you who are wondering why the increased processor speed without the associated bus increase, it is a matter of technology and cost. Take a look at the 1.5GHz machine sitting on your desk. What is the buss speed on that? 400Mhz? Based on the article, shouldn't it be 4+GHz? If you have one of the high speed buss systems, (800MHz) bought memory for it?

There is a constant battle between what could technically be produced and and what it can be produced for ($) and what they think they can sell. There are also issues beyond the normal computational thoughts that go into the item when you start getting these higher frequencies. A 2.4Ghz buss would require a heck of a lot of engineering to ensure you don't kill some guy with a pacemaker sitting next to you on the metro. The stuff we are running PCs at, used to be radio, radar, and microwave you know. Still is actually.

jngold_me
12-06-2002, 04:09 PM
Guys,

Didn't Asus modify the A600 (Zayo) motherboard so as to make it the fastest PPC out there? I know they probably didn't modify the system bus speed. I think it was something like modifing the speed of memory transfers or something like that.

ChezDoodles
12-06-2002, 04:44 PM
The xscale is instruction-set compatible with strongarm, but some things that perform well on strongarm will perform badly on the xscale...
Sure, there are variations between StrongARM and XScale on how to optimize for maximum performance. Just like your compiler for your Desktop PC lets you target 386, 486, 586, etc for max performance. My point was that these are optimizations that yields performance difference in terms of percentages - but the REAL bottleneck will be related to the bus and the cache no matter what instructions you use.

>Sven

mookie123
12-06-2002, 05:08 PM
The xscale is instruction-set compatible with strongarm, but some things that perform well on strongarm will perform badly on the xscale...
Sure, there are variations between StrongARM and XScale on how to optimize for maximum performance. Just like your compiler for your Desktop PC lets you target 386, 486, 586, etc for max performance. My point was that these are optimizations that yields performance difference in terms of percentages - but the REAL bottleneck will be related to the bus and the cache no matter what instructions you use.

>Sven

Yes of course intel botched the Xscale design, but in the meantime Microsoft total refusal to lift a finger optimizing the CE is quite dubious too. I would speculate that the performance gain is bigger than the minute advantage they keep saying. But that's my speculation.

It would be very interesting to watch how much gain the Linux build for Xscale would be.

GO-TRIBE
12-06-2002, 05:31 PM
Yes of course intel botched the Xscale design, but in the meantime Microsoft total refusal to lift a finger optimizing the CE is quite dubious too. I would speculate that the performance gain is bigger than the minute advantage they keep saying. But that's my speculation.

It would be very interesting to watch how much gain the Linux build for Xscale would be.
I disagree, Microsoft is keeping PPC standard ARM4 code just as they always said they would. This is good: 1.) Because it makes it easier for developers to write apps knowing there is once instruction set and one version of PPC 2.) It allows other companies to make processors for use in PPC, which gives us choices. Hey, if AMD can produce a PPC with a 300mHz processor and a 32bit 333mHz bus, it would be high performance

mmidgley
12-06-2002, 05:38 PM
Andy Sjostrom wrote:
> "...he still could not fit everything in the cache"

Of course fitting everything into a data cache is a coder's dream (or memory system designer's dream), but never a reality (unless you are emulating an Atari 2600 game with a 2k image...) The average coder shouldn't have to be concerned with the cache size either.

jtallon wrote:
> One would think Intel would have a recommended chipset to support their Xscale processor, and that the recommended chipset would include a bus fast enough to make the processor look good...

I agree with that. They should have done some research and testing to determine their R&D efforts should have been on that 100Mhz built-in controller. It probably should have been 150 or 200 so the Xscale would not starve.

Bob Anderson wrote:
> but what about a new Pocket PC operating system

Since Microsoft decided to go with ARM they should certainly make it ARM-efficient, but I'd wonder from a business standpoint of putting that much effort into a strong Xscale bias... But I could be wrong. Does Microsoft do that today with Intel vs. AMD for desktops? I'm wondering about future competitors to the Xscale.

Sven wrote:
> Take a look at the 1.5GHz machine sitting on your desk. What is the buss speed on that? 400Mhz? Based on the article, shouldn't it be 4+GHz?

The article did not say that was a general formula for all computers. That was a specific analysis based on an Xscale being fed 16bit chunks. Your desktop doesn't have that limitation. (But yes, a bus speed like that on the desktop would make a killer machine!)

ChezDoodles explained:
> the REAL bottleneck will be related to the bus and the cache no matter what instructions you use

Bus, Cache, and RAM. That's the bottleneck. Coders working at very low levels can manually optimize for a given hardware design, but that makes for expensive code ($$). A well designed compiler/optimizer (which should come from Microsoft/Intel) would keep most coders from having to do this, thus keeping cost down and performance up.

I look forward to when Microsoft, CPU producers, and OEMs combine to make their hardware and compiler work well together instead of just working.


m.

William Yeung
12-06-2002, 05:44 PM
I found a very good reason after some research.
Dont forget some facts:
1. XScale is Intel own stuff like MMX
2. ARM is a standard own by a british company
3. Samsung , TI and many other companies are creating ARM CPU now, even some of them have production implementations on Pocket PC (e.g. HP Jornada 928)

Kati Compton
12-06-2002, 05:47 PM
First, I'm not that Sven :) All you who are wondering why the increased processor speed without the associated bus increase, it is a matter of technology and cost. Take a look at the 1.5GHz machine sitting on your desk. What is the buss speed on that? 400Mhz? Based on the article, shouldn't it be 4+GHz? If you have one of the high speed buss systems, (800MHz) bought memory for it?

There is a constant battle between what could technically be produced and and what it can be produced for ($) and what they think they can sell. There are also issues beyond the normal computational thoughts that go into the item when you start getting these higher frequencies. A 2.4Ghz buss would require a heck of a lot of engineering to ensure you don't kill some guy with a pacemaker sitting next to you on the metro. The stuff we are running PCs at, used to be radio, radar, and microwave you know. Still is actually.

The other issue is that part of the speed of a processor is based on very tiny wires communicating very short distances within the chip. Once you need to go OFF chip, that's when things will always be slower than on-chip communication. The bits have to travel a lot further, and it's more difficult to keep 0's at 0 and 1's at 1 strong enough to register as such over off-chip distances than on-chip distances. That's a big reason why access to on-chip cache is so much faster than off-chip (main) memory.

mookie123
12-06-2002, 06:01 PM
.....
It would be very interesting to watch how much gain the Linux build for Xscale would be.
I disagree, Microsoft is keeping PPC standard ARM4 code just as they always said they would. This is good: 1.) Because it makes it easier for developers to write apps knowing there is once instruction set and one version of PPC 2.) It allows other companies to make processors for use in PPC, which gives us choices. Hey, if AMD can produce a PPC with a 300mHz processor and a 32bit 333mHz bus, it would be high performance

1. there is no reason thinking optimizing the OS/compiler to compensate to Xscale memory design will break application compatibility. Certainly not in Linux, we are not talking about Xscale extention beyond ARM here. For eg. The pocketTV team is working hard to squeeze as small as 5-10% gains, Microsoft surely can put same effort for the same performance gain percentage. People has also been known to pay good money for 3-5% performance in desktop chips. So, not sure what ot make of this Microsoft dubious excuse of "not enough gain for the sweat".

2. It's hypothetical future in next 2 years horizon, irrelevant to fixing the OS to compensate Xscale memory design shortcoming. (even adopting the intel media extensions can fit comfortably within the time frame. Insisting Xsclae to be able to run the exact same OS as the 1.2M units of StrongA PDA are pretty weak excuse in my opinion. There are more Xscale PDA by now. So what if programs run slower in the old iPAQ)

enemy2k2
12-06-2002, 06:08 PM
First, I'm not that Sven :) All you who are wondering why the increased processor speed without the associated bus increase, it is a matter of technology and cost. Take a look at the 1.5GHz machine sitting on your desk. What is the buss speed on that? 400Mhz? Based on the article, shouldn't it be 4+GHz? If you have one of the high speed buss systems, (800MHz) bought memory for it?

x86 are CISC processors whereas ARM are RISC, the latter require higher memory bandwidth because of more operations in comparison.

Jason Dunn
12-06-2002, 06:08 PM
Yes of course intel botched the Xscale design, but in the meantime Microsoft total refusal to lift a finger optimizing the CE is quite dubious too. I would speculate that the performance gain is bigger than the minute advantage they keep saying. But that's my speculation.

Mookie, I know that by nature you're a very distrusting person, but doesn't Sven seem like an intelligent person? Don't you think that if he says OS optmizations wouldn't have a big impact, that perhaps he's right?

Blame Microsoft for what they derserved to be blamed for, not just out of sheer anti-MS loating.

Jason Dunn
12-06-2002, 06:18 PM
1. there is no reason thinking optimizing the OS/compiler to compensate to Xscale memory design will break application compatibility.

No, but it will break the market. Think about it. Back in the first generation of Pocket PCs, it was very confusing for consumers - which software bundle do I buy or download? SH3? MIPS? ARM? It was made worse by developers who insisted on selling their applications based on the CPU type rather than bundling it all together and letting the installer figure things out. Now you're suggesting that going back to that place, where developers have an ARM version and an XScale version, would be a good thing? 8O

For eg. The pocketTV team is working hard to squeeze as small as 5-10% gains, Microsoft surely can put same effort for the same performance gain percentage. People has also been known to pay good money for 3-5% performance in desktop chips. So, not sure what ot make of this Microsoft dubious excuse of "not enough gain for the sweat".

I for one am glad that you're not in charge of anything at Microsoft. :lol: Look, it comes down to this: Microsoft has to decide between optimizing for XScale or adding in the features and fixing the bugs that we've been asking for. So which do you want? You can't have both. Do you want an OS that's 5% faster but exactly the same as what we have now, or a new OS in 2003 that is actually BETTER? I want the new OS!

Insisting Xsclae to be able to run the exact same OS as the 1.2M units of StrongA PDA are pretty weak excuse in my opinion. There are more Xscale PDA by now. So what if programs run slower in the old iPAQ)

Really? Point me to the place that shows the statistics that prove this please. I'm quite confident in saying that there are still more ARM-processor based units on the market than Xscale at this time.

mookie123
12-06-2002, 06:20 PM
Mookie, I know that by nature you're a very distrusting person, but doesn't Sven seem like an intelligent person? Don't you think that if he says OS optmizations wouldn't have a big impact, that perhaps he's right?

Blame Microsoft for what they derserved to be blamed for, not just out of sheer anti-MS loating.

Didn't I post what the Linux maintainer for iPAQ decide to do handling the Xsclae issues? unless you think they have no clue and Microsoft can do no wrong, I would like to see thier performance gian first thanks, before deciding to trust Microsoft blurb about "no such gain.

Now than, do you have any other technical information to add for the public beside "let's trust Microsoft PR"?

enemy2k2
12-06-2002, 06:31 PM
Company policy can change any time, so it's not a matter of trust as it is business sense. Jason (owner of this site) once mentioned that optimizing for Arm5 would be a good thing but not for XScale, after hearing his argument I tend to agree. THere's a lot more processor manufactures out there willing to give us more for our money then Intel. Let them battle it out, in the end it's us who benefit. Intel made a very silly decision with it's low speed bus, took it back a step - 3 millions steps a second rather :lol: They should have designed it to be half the speed the processor runs at, at least. I for one am really interested in what the XScale optimized linux can acheive as opposed to the StrongARM version. That's the thing about having the source, you can compile it however you wish. Hopefully we'll have results on that soon. Samsung recently had an announcement about an ARM5 processor of theres reaching 1.2Ghz, wonder what the power draw would be like though 8O I hope there's a way to hack the Xscale to step them down to whatever speed we wish and up to whatever speeds they can acheive. I am willing to bet that any 300Mhz CPU can do 400Mhz, or even beyond - hopefully someone enterprising enough can figure out a way to do this. It would definitely come at the detrement of power draw though.

Jason Dunn
12-06-2002, 06:40 PM
:onfire:

Didn't I post what the Linux maintainer for iPAQ decide to do handling the Xsclae issues? unless you think they have no clue and Microsoft can do no wrong, I would like to see thier performance gian first thanks, before deciding to trust Microsoft blurb about "no such gain.

I agree, it will be interesting to see what the Linux guys can do to optimize for XScale, but the question is, just because they can do it on Linux, can it be done with Windows CE? I don't have the developer knowledge needed to answer that, but you're assuming that it's the case. Any reason why that is? Oh, I know, because you're a developer and have a deep knowledge of coding for both Linux and Windows CE, right? :roll:

Now than, do you have any other technical information to add for the public beside "let's trust Microsoft PR"?

No, I don't. No offense, but I'm far more inclined to trust Sven's technical explanation and opinion that software optomization will result in only small gains, than I am to trust your rants on the subject.

If you have an equally deep technical explanation for why software optimizations ARE the solution, by all means, share it with us in the forums. And while you're at it, please share your background and education so we know that you have the knowledge to back up your wild claims. :roll:

I really didn't completely trust Microsoft's PR response, because it had no technical details, but after doing some research on this, it seemed evident to me that this was more about hardware than software. After reading Sven's response, the view was reinforced.

I'm just really tired of your senseless rants in the forums - it's one thing to state your opinion, but you seem to think opinion = fact, and you offer nothing to back them up.

mookie123
12-06-2002, 06:56 PM
Can Sven address the problem regarding what the Linux team is doing regarding optimization and if the same strategy is not possible to be implemented in CE? thanks.

PS. Jason, loosen up eh? (same advise as you gave me) I am sure MS won't kill you if it turns out optimization is possible and not neglible. I am just inquiring. if I am wrong, then I am wrong. As far as the public is concern you are in no position to make technical evaluation either. (with the same criteria you apply to me)

lgingell
12-06-2002, 07:05 PM
*Great Article*!

TomB
12-06-2002, 07:54 PM
Sven, Andy and Jason, OUTSTANDING PRESENTATION! Let me say that if Derek Brown from Microsoft had presented this information months ago instead of quoting company policy, he could have deflected the resentment many of us now feel (unjustly) towards Microsoft.

Since we know Sven's background and technical expertise it would be great for him to do a follow-up to what has been discussed here. In particular I would like to know if there is any hope for XScale users who want multimedia PDAs OR as it looks to me, is multimedia doomed because of the 100MHz bus? Is this something that can be fixed? Does the fix come from Intel or the OEMs (is the memory bus part of the XScale chip or is it a component that can be replaced by OEMs?). Will the new compiler Sven mentioned from Intel actually make any difference with an app like PocketTV considering the 100MHz memory bus limitation (I assume this bus is where the "Memory Move" benchmark is tested? Are there other buses in a PocketPC or is the "video bus" the same as the memory bus?

Now here is the question that has been bothering me for months. No OEM is producing StrongArm PPCs any more. Considering what has been said here and assuming OEM engineering knew what they were getting into - why would any OEM in their right mind move on to an inferior overall hardware solution for PPCs? BTW - in the real world using PocketTV there are major differences in framerates on a common QVGA 600KB 24fps test file going from a 206 SA (22-24fps) to a 400XS (22fps) to a 300XS (16fps) to a200XS (12fps). According to Sven, the rates on XScale should be about the same because of the memory bus bottleneck...

enemy2k2
12-06-2002, 07:58 PM
Since we know Sven's background and technical expertise it would be great for him to do a follow-up to what has been discussed here.

I second that vote! I want more :twisted:

sponge
12-06-2002, 08:25 PM
The Linux projects are simply targetting XScale in the compilers.

Wiggin
12-06-2002, 09:18 PM
Uhoh, I feel another Rant coming on... argh... trying to keep my hands off the keyboard... argh :x ... can't stop em... :grab: .... oh well, let her rip!

First, let's start on a positive!
Andy!!! Job Well done!
This is the best piece of informative content PPCT has put out in a long time. More of this type of content would be GREATLY appreciated! I love learning things :way to go:

Now, it seems to me that as I read the various comments, a few folks seem to be forgetting some fundamentals about computing. So let me cover a few of the more simple ones.

1) Revolution is MUCH more costly than Evolution!
We are where we are because of THOUSANDS of decisions made over the past 10 years by many OEMs and application programmers. You can't back up far enough to find the "source" of the speed evils PPCs suffer. It would be easy in 2002 to start with a blank piece of paper and design an awesome PPC that would dazzle and amaze everyone. But try to build a business case for the TOTAL cost of build, and try getting it to market fast enough to be relevant! Go ahead... I'll be the first to say attaboy if you can.

2) Speed is only as fast as the SLOWEST link in the chain
This is self explanatory... I hope that more detail is not necessary... if so, please get thee to the local book store and buy a Tech 101 book.

3) A machine is made up of MANY moving parts. The BEST machine has the largest number of good moving parts.
To focus on one thing in a complex array is to miss the forest for the tree, and you do so at your OWN peril.

One comment was that programmers should not have to care about bus speed or cache... I love these type of people... they keep my industry alive!! TY TY TY for being in every major company in the world. If companies got rid of all the programmers who didn't "care", they would never have to hire Tech Consultants (who "care" very much) to come in and clean things up.

Form size, Processor, Bus, RAM, ROM, Disk size, Power consumption, Heat, Video, Cache, Data Structure, Code Logic, Sort Algorithms, Error handling, Transfer rate, Cost & Price Points... the list goes on and on and on.

Getting everything right is a goal, but an extremely LOFTY goal. Compromises will always be made. They HAVE to be made… sometimes for noble reasons, sometimes for practical reasons, and occasionally for stupid reasons. To blame one decision by Intel, or MS, or HP, or blah blah blah is shortsighted and naive. Shame on you who fall prey to this behavior.

Intel, MS, and HP/Compaq are some of the largest and most successful companies in the HISTORY of mankind. So, give them credit for understanding a few things about the "BIG PICTURE" that the average consumer doesn't concern themselves with. You can disagree with their decisions and their products ALL YOU WANT. The best way to let them know you disagree is to NOT buy their products. But to bash them as stupid or shortsighted is to make yourself the fool. They will laugh at your foolishness all the way to the bank.

Ok, enough… moving the hands off the keyboard… breathing… ah… everything is normal again.. Rant Off!
:beer:

TomB
12-06-2002, 11:34 PM
Wiggin, interesting comments. As a non-tech I got lost looking at the very impressive trees Sven spoke about. Looking at the forest, I see a two year old 206MHz iPaq3650 running PocketTV at 320x240 by about 24fps and a brand new 200MHz iPaq1910 running the same file at about 12fps. Given that both PPCs are operating with memory buses close to 100MHz, and XScale has a larger cache, what than is going on here? Yes I know a PPC is the sum of all of its parts, but what is/are the weakest link(s) in XScale?

baldrik24
12-06-2002, 11:58 PM
To comment on the first paragraph:

The SA-1110 and PXA250 both support 32-bit data busses not just 16-bit. Most PPC2002 devices use a 32-bit wide bus, I'm not sure where he's getting his information from, but it's not correct.

Seraph1024
12-06-2002, 11:59 PM
if the 100mhz bus speed is locked into the intel CPU, can they unlock it using software? Otherwise, you can overclock it all you want, it will not help much. Also, doe JS Overclock work on the XScale PPC? I thought that was only for SARM processor.

my 2 cents
Lwin

Rirath
12-07-2002, 03:11 AM
Mookie, I know that by nature you're a very distrusting person, but doesn't Sven seem like an intelligent person? Don't you think that if he says OS optmizations wouldn't have a big impact, that perhaps he's right?

Intelligence doesn't always make someone right. Do you have any idea how many "intelligent" people have woke up one day, slapped their forheads and said "Wow, I didn't think of /that/ before..."

Janak Parekh
12-07-2002, 04:52 AM
Didn't I post what the Linux maintainer for iPAQ decide to do handling the Xsclae issues? unless you think they have no clue and Microsoft can do no wrong, I would like to see thier performance gian first thanks, before deciding to trust Microsoft blurb about "no such gain.
Absolutely, we'll be watching. But note that --

1) On x86, targeting specific processors via gcc (or maybe even pgcc) typically gives you a 5-10% performance gain. While this is not completely insignificant, instruction optimization is by far not the biggest bottleneck for most typical application software. If you're going to use SSE-optimized code, then yes, there might be a difference; but evolutionary instruction sets are largely similar.

2) #1 is the reason why most major Linux distributions use 386-optimized code as their default target. This includes RedHat, Debian, and Mandrake. The biggest "optimizable" distribution is Gentoo, and that's largely because Gentoo compiles the entire distribution as a part of installation. Typically that's not for the average end-user.

2a) Windows 2000 and XP are targeted for Pentium and above processors. That is at least one major generation "behind", more like 2 generations IMHO. Windows 98 is optimized for a 486.

3) The biggest performance problems on the XScale are on memory operations, from the stats I've seen. This is very much a bus problem. Instruction mix optimization is going to yield you very little gain, if any.

Quite frankly, I'd rather see bus optimization than a fragmented market. Microsoft Windows 2k, XP, and most Linux distros generally agree with this principle.

Oh, and thanks so much for this article. I now have somewhere to point everyone who says "it's all MS's fault". MS really should have communicated better, but nothing is ever that simple.

--bdj

gpspassion
12-07-2002, 05:15 AM
Something I didn't see mentioned in the article is that the cache is not "completely" enabled apaprently. Now this is just hearsay and I'm in no position to judge whether its accurate or not, but this is from a very qualified software engineer who generally knows what he's talking about.
Apparently fully enabling the cache on the X-Scale would make the system unstable, due to a design flaw that has already been corrected in new iterations of the x-scale. Anyone care to comment on this?

mookie123
12-07-2002, 09:06 AM
3) The biggest performance problems on the XScale are on memory operations, from the stats I've seen. This is very much a bus problem. Instruction mix optimization is going to yield you very little gain, if any.

hmmm, I see several problem with this "build better bus" and all ills will be gone.

1. Is it just possible that intel is hitting engineering if not cost constrain?Sven's article seem talking about building "new bus" system as if it's trivial. Is it just possible that all his suggestion (higher clock, wider size) is just cannot be currently implemented?

2. Even if intel engineers are total goof ball and doesn't understand chips design (unlike mister Sven), What are the real world alternative in the near future? Does anybody know any of intel's plan to introduce new bus architecture?

It seems to me, we are stuck with PXA250 for a good while yet.

Now having speculating as such, let's note that
1. Intel and Microsoft is building new compiler for Xscale. I am not sure about you but that seems to be a total, long term structural change to realign everything to fit Xscale hardware behavior better. This is Software optimization.

2. Mister Sven himself indicates that algorithmic trick to adapt Xscale bus and cache at application level is possible to improve performance significantly. Could it be possible that can happen at kernel level too?

3. Microsoft has no hesitation making major OS Kernel alteration in Ver 1,2,3 (threading, memory management, etc) without creating the alledge "market upset" specially noted during 2-3 transition.

4. I find it amazing a bunch of "software guys" start talking about how to design "better chip". If it were that easy, wouldn't intel be calling them already? Their view is "if we all drive lamborghini, surely we gonna get somewhere faster (design better bus", Let's not talk about driving shorter path or avoid the downtown traffic jam (bus/cache problem)"

----------------
anyway, If you ask me we need opinion from chip engineers, Kernel guru and compiler designer about what could be done for short term optimisation until new hardware design is actually out.

I wouldn't trust bunch of apps writer and PR guys start yapping about designing "new bus" if they can't even tell their story straight. And I still think Microsoft is not doing enough to ractify this problem, or at least therr explanation doesn't make sense to me.

ChezDoodles
12-07-2002, 09:42 AM
Sven's article seem talking about building "new bus" system as if it's trivial. Is it just possible that all his suggestion (higher clock, wider size) is just cannot be currently implemented?
The article do not suggest that building a new bus is trivial. And, as others also have suggested, there are other factors that blends into the issue of speed differences between XScale and StrongARM.

The purpose of the arcticle was... to shred some light on how other factors than just raw processor speed have a huge impact on the total system speed
to pinpoint how a lack of understanding among software developers about basic hardware design issues often leads to slow execution of their applications
suggest that the solution to the StrongARM vs. XScale speed issue is not something that could have been easily resolved by Microsoft or other parties with just a quick "optimization" or recompileAs I've already stated in a previous comment, we're pretty much stuck with the current bus limitation, given that the bus controller is an integrated part of both StrongARM and XScale microprocessors.

Many developers seems to have an ignorant attitude towards hardware issues, thinking "they should not have to know about issues like cache". Such ignorance leads to software design desicions that sometimes fight against the hardware, like Joe Coders PIM in the example.

I'm not suggesting that we should start developing specifically for XScale either. But issues like memory speed, bus bandwidth and cachelines are inherent in almost every hardware system we might target in our profession, and our software should always be optimized according to these general principles.

Regards,
Sven Myhre

mookie123
12-07-2002, 10:22 AM
The purpose of the arcticle was... to shred some light on how other factors than just raw processor speed have a huge impact on the total system speed
to pinpoint how a lack of understanding among software developers about basic hardware design issues often leads to slow execution of their applications
suggest that the solution to the StrongARM vs. XScale speed issue is not something that could have been easily resolved by Microsoft or other parties with just a quick "optimization" or recompileAs I've already stated in a previous comment, we're pretty much stuck with the current bus limitation, given that the bus controller is an integrated part of both StrongARM and XScale microprocessors.
...

I'm not suggesting that we should start developing specifically for XScale either. But issues like memory speed, bus bandwidth and cachelines are inherent in almost every hardware system we might target in our profession, and our software should always be optimized according to these general principles.


First of I have to apologize for all the sentiment shown, but as you know this is particularly heated debate on the net, and I haven't been known for impeccable decorum. I suppose you just have to suffer through it. lol

and yes, your article is unfortunately sucked into debate who to blame for some of current Xscale PPC performance woe, which apparently not the goal of the article at all as you mention above. But nonetheless it is an interesting subject to explore, and since you seems to be the only expert speaking I think I might as well ask more questions:

1. Is it not kernels responsibility to manage hardware resource, specially things like virtual memory and memory? And adapt to particular chip design quirks?

2. how can alteration of such item will not affecting overall OS performance significantly knowing that the memory design between SA and PXA250 are different and that PPC2K is better on SA.

3. we also know that WinCE can be targetted to a wide range of processors, How did Microsoft do the adaptation? specially item that relate to performance optimization? Wouldn't optimizing for Xscale be just another task? how much compatibility divergence will result from such activity if done on PPC version?

I guess what I am getting at ultimately maybe, Is it true that nothing can be done at kernel and compiler level that won't speed up performance globally? It seems to me that is hard to believe given how people are toiling days and nite tweaking other versions of OS just for such goal. And the view that it's all hinges on intel doing something is terribly depressing, and I still believe is not true.

cheers.

ChezDoodles
12-07-2002, 10:58 AM
mookie123 wrote:
Is it true that nothing can be done at kernel and compiler level that won't speed up performance globally? It seems to me that is hard to believe given how people are toiling days and nite tweaking other versions of OS
Your questions are very much relevant :)
I personally doubt that Microsoft does any significant changes at all between the various hardware versions of Windows CE, except adapting the memory manager, handling interupts and other necessary low-level stuff. These are adaptions however, not optimitizations.

The question is: Would it matter if they did?

When I run my game application, the OS will normally take up less than 5 % of the CPU time. Getting Microsoft to optimize these 5 % (the OS) to run maybe 10 % faster would yield a 0.5 % overall speed improvement.

The rest of the CPU time is spent in my code. So if I can get my code to run 10 % faster - it would yield a 9.5 % overall speed improvment. In Joe Coder's example, he could make his app work 400 % faster when sorting the surnames in his contact list just by reorganizing the data. How many developers talk about optimizing their their data structures - or - run their software through a performance analyser like Intel's VTune? The ARM architecture don't support even simple instructions like division! Every time a coder perform a divison, thousands of cycles are spent in software division routine. And floating point data must be handled by an emulator! There are a dozen ways a 400 MHz XScale speed demon can be reduced to a crippled 0.2 MHz lame brick of silicon even without messing up the cache or abusing the memory bus.

This is what developers do to your Pocket PC - not Intel, not Microsoft!

Many developers think that ignorance in bliss - but their bad habbits gets saturated as the hardware chip makers turns to more sophisicated methods of increasing the speed. The 400 MHz XScale exposed Joe Coder as a bad, lazy and/or ignorant craftsman. He would have got a significant performance increase by changing the data structures even on his old StrongARM - but with the XScale, the difference was even bigger.

Writing high performance software is a difficult task. Hopefully this article and all the good comments have shred some light into our profession and will help improve our future applications. One thing is for sure: The next generation of mobile CPU's will widen the gap even further :wink:

mookie123
12-07-2002, 11:43 AM
When I run my game application, the OS will normally take up less than 5 % of the CPU time. Getting Microsoft to optimize these 5 % (the OS) to run maybe 10 % faster would yield a 0.5 % overall speed improvement.

The rest of the CPU time is spent in my code.

wait I don't understand this, are you saying that your code control every detail bit movement within the hardware?

My understanding is, your high level code is "compiled" and than this compiled code is executed by the hardware in tandem with OS control.

so even if you have excellent algorithm, if the compiler is ugly and not optimized very well for the target CPU, it would still run very slow.

And can you explain how "percentage time" translate to memory transfer path efficiency? the 95% you are talking about, How much control programmer has over the how and when each bit suppose to flow? I thought a lot of it specially in things like graphics bits, the task has been relegated automatically to OS, things that described in API/library. Also how does OS virtual memory managment come into play?

do you have absolute total control over every bit movement in that 95% of the time? I start to have this image everybody is writing in assembler or binary. heh....

Landis
12-07-2002, 01:25 PM
Great article!

I think many GAME developers have known that memory speed on Pocket PCs was the limiting factor, but I like the way you used a database access example to explain how the issue extends to general use.

BTW - in the real world using PocketTV there are major differences in framerates on a common QVGA 600KB 24fps test file going from a 206 SA (22-24fps) to a 400XS (22fps) to a 300XS (16fps) to a200XS (12fps). According to Sven, the rates on XScale should be about the same because of the memory bus bottleneck...

Yes, and this is where we can see that it's not just a memory bus speed issue. It can't be. However, consider that Ipaqs and Toshibas with the 400 Mhz X-Scale had very different movie performance out of the box given the same processor/bus speed/OS kernel/compiler. The Toshiba ROM update greatly improved video performance thanks to a new video driver (among other things?), but there are other hardware drivers and design differences that effect performance.

ChezDoodles
12-07-2002, 02:48 PM
mookie123 wrote:
are you saying that your code control every detail bit movement within the hardware?
Yes, that is what I'm saying :D

This is a bit outside the topic of this discussion, but I'll comment on this briefly:
The OS will set up a "sandbox" for my appliation and then hand over full control over the CPU and releated subsystems, like memory access - as long as I stay within the limits of the "sandbox". The limits includes a certain timeout for when the CPU will give control back to the OS so it can check if other applications with higher priority than mine are sceduled to run. Other limits include direct access to hardware resources (like COM ports), certain code instructions that is classified as restricted to the OS or if my application access memory that is currently paged out of physical RAM. In these events, the CPU gives control to the OS so the OS can decide how to respond. But as long as I run along as a nice application - I will have full control over the CPU, cache, memory, etc.

ChezDoodles
12-07-2002, 02:56 PM
consider that Ipaqs and Toshibas with the 400 Mhz X-Scale had very different movie performance out of the box given the same processor/bus speed/OS kernel/compiler. The Toshiba ROM update greatly improved video performance thanks to a new video driver (among other things?), but there are other hardware drivers and design differences that effect performance.

You're right - there are a lot of issues that affects overall performance. What you refer to in your example is very much due to different and dedicated 2d graphics controllers, which off-load the cpu and the memory system in various degrees in performing certain operations.

These are outside the limited scope of my article.

TomB
12-07-2002, 05:19 PM
Sven thanks for the great discussion and information! The movie example was mine linked to questions that have not been answered. Also - the ATI video controller is only in the e740 (one out of eight XScale PPCs) and should not be considered in a general discussion of the XScale problem.

QUESTION: Is there any hope for performance of a 200MHz XScale to at least match what we had on the 206MHz StrongArm - or is the memory bus issue a critical flaw that cannot be corrected in XScale (is the memory bus on the XScale chip or a chip in itself - and is this the same as the "video bus" or "frontside bus")?

QUESTION: If the memory bus is the critical limitation and is about the same speed in both processors, why are we talking about half the framerate going from 206MHz SA to 200MHzXS on PocketTV? You have mentioned "other" bottlenecks but it seems strange that much of a performance hit is about the same across different brand PPCs and CPU speeds.

QUESTION: If there is no hope of ever bringing XScale real-world performance in line with the StrongArm's - why is no one manufacturing StrongArm PPCs any more? Are there other hidden XScale gains or should we be asking OEMs to go back to StrongArm? In other words, when and can we buy a PPC with the new displays and smaller sizes that will at least perform like the old 3650?

Kati Compton
12-07-2002, 05:36 PM
The ARM architecture don't support even simple instructions like division! Every time a coder perform a divison, thousands of cycles are spent in software division routine. And floating point data must be handled by an emulator! There are a dozen ways a 400 MHz XScale speed demon can be reduced to a crippled 0.2 MHz lame brick of silicon even without messing up the cache or abusing the memory bus.


I didn't realize that there wasn't native "/" support, but it makes some sense. How would one most efficiently generate "random" integers? Normally I'd use rand() and divide by MAX_RAND or something like that to get 0-1, then scale up to the range I need and cast the result to int. Not a significant amount of processing time on a normal desktop, but sounds like it would be for PPC devices... Esp. if doing it a lot, such as when "shuffling" a playlist or card deck.

mookie123
12-07-2002, 05:58 PM
mookie123 wrote:
are you saying that your code control every detail bit movement within the hardware?
Yes, that is what I'm saying :D

ah ok, if most of the time your code doesn't even need the OS to certain extent no wonder you ask for hardware improvement. Not sure why you don't start taking advantage of the full Xscale features tho, such as those media extension and V5 instructions, yes with caveat it will only run on Xscale.

but anyway interesting, thanks. going to read more about architecture of OS and hardware. see you later.

mookie123
12-07-2002, 09:11 PM
For those who are interested:
about 3/4 down page.
http://msdn.microsoft.com/chats/embedded/embedded_103102.asp

Host Guest_briankr_ms:
Q: Does 4.1 include XScale optimization?

Host Guest_briankr_ms:
A: Yes. You need to use the /xscale compiler switch. See clarm -?.

-------------
Host Guest_Guyhay_MS:
Q: Does 4.1 include XScale optimization? If not when can we expect it?

Host Guest_Guyhay_MS:
A: 4.1 does include XScale optimisations. and 4.2 will have more. CPU optimisation and performance tuning are an ongoing areA: That said I think a lot of people were expecting a clock speed related improvement from a 200MHz StrongARM to a 400 MHZ Xsale

mookie123
12-07-2002, 09:29 PM
Not exactly PPC, but related to PXA250
http://msdn.microsoft.com/chats/embedded/embedded_032202.asp

LarryMorris_MS
Q: Why is that win CE.NET does not support ARM 5TE the core used in PXA250, but only ARMV4I/T

LarryMorris_MS
A: Perhaps the issue is what you mean by "support ARM5TE". We do support ARM5, including compiler flags to optimize

LarryMorris_MS
for the ARM5 pipeline, and support ConCan. What we do not do is recompile our OS bits specifically for ARM5 since

LarryMorris_MS
there is negligible performance improvement by doing so. Most ARM5 perf issues come into play when you design multimedia


-------------
LarryMorris_MS
Q: I am developing a multimedia application on PXA250. With win CE.NET I am getting 1/3 performance of what I get without OS. Any Idea what the problem could be?

LarryMorris_MS
A: I guess I would need more info on what you mean by 1/3 performance with OS. Under what scenarios. A multithreaded test will typically lose several percent performance due to the overhead of system ticks, etc. However a 70% degradation is unbelievable unless perhaps your standalone test is running purely in cache, while the OS scenario causes the cache to frequently be blown.

Yuta
12-08-2002, 01:58 AM
The memory bus controller is integrated into the XScale and locked to 100 MHz. Nothing we can do about it. However, the controller supports both 16 bit and 32 bit memory.

Can you probe that?. A link would do it.
:idea: If what you are saying it's true then swaping the controller to 32 bit mode should solve our problems. If this is possible, and the hardware is adapted, the OEM's have a chance to solve the problem!!!. :P :P :P

:arrow: SVEN, please, post a comment on this. VERY IMPORTANT!!!.

Landis
12-08-2002, 04:19 AM
The movie example was mine linked to questions that have not been answered. Also - the ATI video controller is only in the e740 (one out of eight XScale PPCs) and should not be considered in a general discussion of the XScale problem.

Yes, I'm sorry. I quoted you, got sidetracked, ran out of time, and then didn't finish addressing the question. :oops:

TomB, your point about very different performance at various core clock speeds is very relevant and perplexing. The bus speed is independent of core clock and is fixed at 100 Mhz (99.5 actually). So if the bus speed is the limiting factor, then movie performance should be nearly identical when the core is adjusted from 200-400. I don't know where you got your numbers, but I've seen somewhat similar results reported when core speed is adjusted using the power saving control.

As Sven mentioned, coders like the Pocket TV Team are working very "close to the metal". Below the OS level. The only thing left below that are the basic chip drivers. The equivalent to the motherboard bios on PCs. There could be some sort of memory timing issue at this level that is closely associated with core speed.

I suspect that nobody, at least not outside Intel, knows exactly where the problem lies. Intel isn't likely to say until or if they have a solution.

TomB
12-08-2002, 06:49 AM
Landis don't worry about the quote, I am most interested in answers to the question in that post. As far as the backup on the results I mentioned, that is the result of personal tests on all of the PPCs in question, starting in June at PCExpo and following up on the latest releases from Toshiba and HP. As far as the XScale mess, it is becoming clear that responsibility rests equally with the OEMs for migrating from StrongArm in the first place, Intel for premature chip release and Microsoft as owner of a specification they did not enforce. But the burning question now appears to be are PPCs now going to be crippled for the life of XScale or is there an off-chip solution?

Along those lines I would still like to know: if the memory bus is on or off the XScale chip; are memory, video and frontside buses all the same or is this another can of worms; and why don't the OEMs move back to StrongArm while Intel and Microsoft solve the performance problems (has Intel stopped making StrongArm)?

BTW - this fiasco could become the perfect casebook study for marketing students in what can go wrong between product development and implementation. That will be especially true if marketing and pricing drive demand to the point where performance is no longer a consumer issue and XScale PPCs succeed in the marketplace!

Landis
12-08-2002, 04:30 PM
TomB, everything you mentioned is on-chip. Here is a link to the Developer's Manual.
ftp://download.intel.com/design/pca/applicationsprocessors/manuals/278522-001.pdf

Say what you want about Intel, but they're good about making technical documentation available to the general public! There is a simple overview of the chip layout. There are extensive details about the control registers and dozens of registers dedicated to memory control. These should be accessible through assember code. This is NOT something you want the average coder accessing, but would be appropriate for coders of GapiDraw or DirectX for Pocket PC APIs or game engine makers such as FatHammer.

There may be little any coder can do to speed this processor core related memory slowdown if there is some flaw in the on-chip architecture or logic. Keep in mind that slowing the core speed also slows the on-core cache speed. This more than anything might explain the drop in frame rate relative to core speed. The XScale does allow applications to reconfigure 28 KB of the 32 KB data cache as data RAM which can be used for tables and often used variables if one doesn't want to trust the chip's logic to cache these. The details can be found in the Core Developer's manual here
ftp://download.intel.com/design/intelxscale/27347301.pdf

As far as marketing is concerned, to anyone but the most serious nerd :wink: 400 Mhz is 400 Mhz. Those cycles ARE available even though it may be hard to keep them fed. XScales certainly haven't really increased performance of Pocket PCs yet. Though as we've seen, at least they haven't increased the price either.

Pony99CA
12-08-2002, 06:42 PM
This is what the Linux maintainer do to handle X-scale situation.
Apparently there is some sense in dual built, contrary to general opinion made by Microsoft PR.
----------------------

The xscale is instruction-set compatible with strongarm, but some things that perform well on strongarm will perform badly on the xscale. In particular, multi-word load and store instructions run slowly on xscale. These are used in the standar procedure call prolog and epilog, so all user-mode applications will have to be recompiled in order to perform well on xscale. Both Debian and Handhelds.org have plans to do this for their respective distributions. It is possible that xscale binaries will perform will in both time and space and so we can standardize on binaries optimized for xscale. If not, we will perform dual builds in order to support both strongarm and xscale.

http://www.handhelds.org/projects/h3900.html
First, they did not say they would do dual builds, they said they might if another attempt didn't work.

Second, Linux is already supporting multiple processors, so perhaps one more isn't a big deal. While Windows CE targets multiple processors, the Pocket PC part may not.

Third, what Linux programmers as part of a freeware project do is a different matter from what Microsoft does as part of a commercial enterprise.

Fourth, maintaining dual code bases (even if the appropriate sections of code are inline) is more difficult than maintaining a single code base. I'm a software developer and have done it. It's something you want to avoid if you can.

Steve

Pony99CA
12-08-2002, 06:52 PM
PS. Jason, loosen up eh? (same advise as you gave me) I am sure MS won't kill you if it turns out optimization is possible and not neglible. I am just inquiring. if I am wrong, then I am wrong.

&lt;RANT>
:onfire:

You are not "just inquiring". You're speculating and making statements. As proof, here is one of your posts:


1. there is no reason thinking optimizing the OS/compiler to compensate to Xscale memory design will break application compatibility. Certainly not in Linux, we are not talking about Xscale extention beyond ARM here. For eg. The pocketTV team is working hard to squeeze as small as 5-10% gains, Microsoft surely can put same effort for the same performance gain percentage. People has also been known to pay good money for 3-5% performance in desktop chips. So, not sure what ot make of this Microsoft dubious excuse of "not enough gain for the sweat".

2. It's hypothetical future in next 2 years horizon, irrelevant to fixing the OS to compensate Xscale memory design shortcoming. (even adopting the intel media extensions can fit comfortably within the time frame. Insisting Xsclae to be able to run the exact same OS as the 1.2M units of StrongA PDA are pretty weak excuse in my opinion. There are more Xscale PDA by now. So what if programs run slower in the old iPAQ)

There is not one question mark in that post. If you want to ask questions, ask them, but Jason has asked you to stop making technical pronouncements.

I'm a software developer with two degrees and 18 years programming experience, and I wouldn't try to say what is possibile in Windows CE. I might express my informed opinion, but not having worked on OS design (other than an operating system class in college) nor having programmed on the CE platform, I would be guessing. How much operating system design have you done, or how much CE programming have you done, that makes you think you can do better? In fact, what programming experience of any kind do you have?

OK, I'd better stop before I :microwave:
&lt;/RANT>

Steve

Janak Parekh
12-08-2002, 07:15 PM
First off - Sven - great answers to the questions on this board. Now, let me see if I can shed more light here...

My understanding is, your high level code is "compiled" and than this compiled code is executed by the hardware in tandem with OS control.
Not exactly. Your code runs on the CPU. The OS does some management, and only "takes over" in two main situations:
- During a context/process switch, when the OS tells the CPU to jump to other running code;
- When the application program requests an OS service, like opening and reading from a file (i.e., a "system call").

The OS's impact in these situations vary greatly. However, most "modern" OS environments have been doing their best to get the OS out of the way so that applications can run faster. Ergo, tools like DirectX or GAPI that bypass standard GDI calls and let you write "directly to the screen".

do you have absolute total control over every bit movement in that 95% of the time? I start to have this image everybody is writing in assembler or binary. heh....
As long as you're the scheduled process, you can do whatever you want within limits. As Sven put it, there's some sandboxing, but other than that, sure, go ahead and execute raw machine instructions. In fact, there's an annual competition called "Assembly" (www.assembly.org) where people do hand-write machine instructions to squeeze the performance out of their target platform. Guess what these resulting binaries run on? For PC's, Windows and Linux.

--bdj

Pony99CA
12-08-2002, 07:16 PM
Many developers think that ignorance in bliss - but their bad habbits gets saturated as the hardware chip makers turns to more sophisicated methods of increasing the speed. The 400 MHz XScale exposed Joe Coder as a bad, lazy and/or ignorant craftsman. He would have got a significant performance increase by changing the data structures even on his old StrongARM - but with the XScale, the difference was even bigger.

OK, I'm getting a bit tired of the "bad, lazy and/or ignorant" talk. Not all programmers can know everything. That's why we have different job descriptions and different layers of application development.

I'm one of those programmers who tries to avoid hardware issues, but that's by choice. If I wanted to worry about hardware issues, I'd be writing operating systems, compiler backends or games. I'd probably also be working in Assembler. :-)

Like everything else in software development, you can always make improvements to performance. What most programmers try to do is ship a program when it's "good enough", not "perfect". If a new processor causes our program to beome less than "good enough", it may be our fault for design assumptions made, but that doesn't make us bad, lazy or ignorant. It means we have to learn enough to make our program good enough again.

As you yourself said, there may be reasons to keep data structures in a non-optimal state (performance-wise) -- maintainability, for example. Each application may require those decisions to be made differently, and calling programmers names isn't really helping.

Didn't Intel tout the XScale as compatible with the StrongARM and running at 400 MHz (for the most commonly shipped one in Pocket PCs)? I think it's not an unreasonable assumption on most programmers' parts to believe that Intel wouldn't have done something to make their programs actually run slower. I suppose that could be lazy or ignorant on their part, but I think it says more about Intel's design and/or PR.


Writing high performance software is a difficult task. Hopefully this article and all the good comments have shred some light into our profession and will help improve our future applications. One thing is for sure: The next generation of mobile CPU's will widen the gap even further :wink:

Yes, writing high-performance software is difficult. Writing any good software is difficult, but not all software has to be high-performance. If a processor causes software that was running well to suddenly not run well, it says as much about the processor as the program, I think.

All that said, I do think that was a good article, and I appreciate you putting that information out. I hope you've put something similar on the Pocket PC developer sites to let all the Joe Coders out there know about this. :-)

Steve

Janak Parekh
12-08-2002, 07:44 PM
OK, I'm getting a bit tired of the "bad, lazy and/or ignorant" talk. Not all programmers can know everything. That's why we have different job descriptions and different layers of application development.
Not that you're a bad programmer, but there are a lot of "bad, lazy and/or ignorant" programmers. Witness the number of C++ programmers who don't use the STL and still implement linked lists by hand to get vectors. :)

BTW - we should be careful about what an "OS" is: in modern OS's, we have to keep the kernel issues somewhat distinct from the bundled-application issues.

--bdj

mookie123
12-08-2002, 08:43 PM
As long as you're the scheduled process, you can do whatever you want within limits. As Sven put it, there's some sandboxing, but other than that, sure, go ahead and execute raw machine instructions. In fact, there's an annual competition called "Assembly" (www.assembly.org) where people do hand-write machine instructions to squeeze the performance out of their target platform. Guess what these resulting binaries run on? For PC's, Windows and Linux.

Right, I didn't understand the extend of the freedom of compiled code within the convine of each processes/thread.

mookie123
12-08-2002, 09:00 PM
I'm a software developer with two degrees and 18 years programming experience, and I wouldn't try to say what is possibile in Windows CE. I might express my informed opinion, but not having worked on OS design (other than an operating system class in college) nor having programmed on the CE platform, I would be guessing. How much operating system design have you done, or how much CE programming have you done, that makes you think you can do better? In fact, what programming experience of any kind do you have?


See the developer chat transcrip above regarding OS optimization before making your informed opinon. thanks.

Daniel
12-08-2002, 09:19 PM
Pony99CA wrote about the whole "lazy" programmer thing, that not all programmers can know all things. I have to say that I agree with this. There is so much to know in programming, optimizing code is one of those really in depth jobs that not everyone can do. Ideally the optimizer would know everything and perhaps just do that. Realistically, everyone the writes code is expected to know how to optimize their code to at least a certain extent.

I will say this though, there are 2 sides to this, there are lazy programmers out there as well as diligent programmers that don't have all the information they need to do a perfect job. Let's be honest we (programmers) have all probably worked with at least one other programmer who just doesn't try, I call that lazy. I've also worked on projects where programmers are specifically told not to optimize anything, this scares me a bit, but what do you do? That doesn't make me lazy.

Daniel

Janak Parekh
12-08-2002, 11:33 PM
I've also worked on projects where programmers are specifically told not to optimize anything, this scares me a bit, but what do you do?
Huh? If it's that vague something is very wrong. Every programmer wants to write reasonably optimal code, at least within the boundaries of their knowledge.

--bdj

Pony99CA
12-09-2002, 12:16 AM
I'm a software developer with two degrees and 18 years programming experience, and I wouldn't try to say what is possibile in Windows CE. I might express my informed opinion, but not having worked on OS design (other than an operating system class in college) nor having programmed on the CE platform, I would be guessing. How much operating system design have you done, or how much CE programming have you done, that makes you think you can do better? In fact, what programming experience of any kind do you have?

See the developer chat transcrip above regarding OS optimization before making your informed opinon. thanks.
I had read the whole thread (but hadn't followed all of the links) before making those last posts, so you're welcome. For future reference, I generally read an entire thread before posting to it.

I also haven't expressed any opinion about whether optimizing for XScale would work or not, unlike you, so you're welcome again. See, sometimes, I actually know when to keep my mouth shut. :-) (And, no, this isn't one of those times. :lol:)

You can post all of the transcripts your want, but if you don't have much programming background, how much of them do you understand? I tried to explain to you why dual-pathing code or, worse, keeping multiple code bases was a bad idea. Are you disagreeing with that?

Now let's hear about your programming background; you've been asked at least twice, but for some reason seem to be avoiding that question....

Thanks.

Steve

TomB
12-09-2002, 05:19 PM
Getting back to the thread - I think we can say that no matter how fast they become or are overclocked to, PDAs with the XScale chips we have now are NEVER going to outperform 206MHz StrongArm PPCs. There are many reasons for this but the PRIME reason is that the 100MHz memory bus is locked as part of the XScale chip and cannot be optimized off-chip. In other words, XScale PDA customers will be buying an inferior device until Intel physically changes the bus on their chips (if ever). Please correct this if I am wrong. And Jason, if you can get someone from Intel engineering to review this thread and comment - that would be incredible!

Finally, if this is true, WHY ARE OEMs USING THIS CHIP INSTEAD OF THE STRONGARM?

ChezDoodles
12-09-2002, 05:57 PM
I think we can say that no matter how fast they become or are overclocked to, PDAs with the XScale chips we have now are NEVER going to outperform 206MHz StrongArm PPCs
I don't think you can be so definitive, TomB. The fact that XScale and StrongARM use buses with almost identical bandwidth (aprox. 100 MHz and 16 or 32 bit width) does not mean that an XScale can never outperform a StrongARM. The larger cache, higher internal processing speed and new extensions to the instruction set means that applications do have a potential of running much faster if developers learn to design code and data access patterns that benefit from these improvements.

There are other issues as well for why the current crop XScale Pocket PC's do not perform according to their potential (and our expectations) - but the scope of the article was to explain how a single subsystem sometimes can lead to possible bottlenecks - especially if developers are not aware of issues at hand - and why memory intensive operations will never be able to run faster than they do on StrongARM.

Don't treat this as "the one and only" explanation - but rather as one important piece of the overall puzzle :lol:

enemy2k2
12-09-2002, 07:37 PM
I'm pretty sure that if you had an XScale o'clocked to 1Ghz, then it would definitely beat a StrongARM, even with the 99.5Mhz bus. :P I wouldn't want to be the one to try it though :lol:

TomB
12-09-2002, 08:53 PM
Sven, I understand where you are coming from. As a businessman who has been waiting to invest in multimedia for PPCs I need an overall statement on XScale to figure out my next move. So let's try this again. Can we say that given the hardware we have today, what is PRACTICAL to implement as a fix, and the severe limitations that are now built into this chip, this generation of XScale will probably never perform on the level of StrongArm? If this is wrong, what could we say about the media capability of XScale PPCs?

Yuta
12-11-2002, 06:00 PM
Can someone clarify me if OEM's could've done something to use the bus on 32bit transfer mode?.
That is, if ChezDoodles is right on that it can be done.