GigaOM Network: GigaOM | WebWorkerDaily | NewTeeVee | Earth2Tech | OStatic | jkOnTheRun | Mobilize 08 | Jobs | About | Advertise | Contact

April 19, 2008

The shortcomings of benchmarks

Radar We do a lot of system reviews here on jkOnTheRun and invariably when we publish one there will be a contingent of readers who want to see raw benchmarks of the hardware components to try and determine how fast a system is.  I understand why some folks find them interesting for comparing different computer systems but I also feel that they are not nearly as useful as some.  That is the reason why I never run nor publish benchmarks in my system reviews, to the chagrin of some.

I find through experience with many different systems, both desktop and mobile systems, that benchmarks never paint an accurate picture of how well a given device will perform in the real world.  There are too many factors in play at a given time that affect the user's experience for a single benchmark to tell the story well.  It really doesn't matter if a given system can compute the value of pi to a million decimal points quickly if it takes too long to open a windows for the user.  The reality is that there are too many factors in play on modern systems for raw power benchmarks to give an accurate indication of how well a given system can perform them.  Today's complicated systems are affected by many factors, CPU power, hard disk speed, memory and HDD caches, graphics subsystems, total installed memory, operating system version, and internal components which all play a role in how well a system performs overall.  Individual benchmarks don't reflect this in my opinion and this is why I don't publish them.

CPU benchmarks aren't the only ones that are often bandied about when a new system hits the market, usually we'll see battery eater statistics too.  These can be interesting but again, real world usage is what really matters, not how quickly a given battery can be run dry.  There is a reason why modern systems have sophisticated power management systems in place and these are the single biggest factor affecting total battery life, aside from how many cells a given battery contains.  This is why I always try to give a real world indication for how long a given battery lasts for me during an evaluation and not a simple benchmark.  It has more meaning in my experience.

Another statistic that is often requested is how long a given system takes to perform a cold boot.  While this can also be interesting I don't put much value in this statistic.  I have come to realize that how long a given device takes to boot is no real indication of how well a system will perform once running.  There are so many factors affecting boot time, especially how many and what utilities are executed during the boot process, that how long it takes to boot is not an accurate indicator of anything.  The speed of the hard drive also comes into play a great deal and this is never considered in these boot time statistics.  Devices that boot slowly can often run quickly and vice versa so I never consider how long the boot time is on one of these devices.  You can't even compare a given system's boot time with other systems because of the factors I mentioned, the numbers don't mean anything if there are different factors affecting the boot times.  I find there usually are such factors too so that's why I take boot times with a grain of salt.  I almost never boot a system cold anyway, these days I always use standby/ sleep and resume exclusively. 

I'm not putting down those who want to see benchmarks, I understand that with a new system people are often clamoring for any information they can get.  I'm just pointing out why I believe these varied benchmarks have little use to me and that they should be taken with a grain of salt, not as hard facts that provide a real indication of how well a system will perform for the user.

Enjoy this post? Receive more jkOnTheRun content for FREE by subscribing to the RSS feed!

Comments

Bravo. Ultimately, what's important is real-world performance. Lacking first hand experience, benchmarks may serve as a proxy for comparison purposes. However, we shouldn't fool ourselves into thinking that benchmarks can replace first hand experience, or that benchmarks are, somehow, better than first hand experience.

>>>That is the reason why I never run nor publish benchmarks in my system reviews, to the chagrin of some.

Bah! Just send the Mini Note to me and I'll do the damned benchmarks and post them.

Of course, you can expect the Mini Note to get lost in transit on the way back. But you can't have everything -- even though you surely try! Heh-heh.

I agree with you that benchmarks don't reflect everything because of the many underlying factors. To my opinion published benchmarkresults should be published with a explanation of the testsituation in which the researcher have ruled out as many factors as possible. Even then it should indeed still be treated as an indication.

But I also think that a good performed and understated benchmark will be an addition to review. Especially because it is objective & repeatable. A writers review often reflects only the subjective view/experience of himself. In that case you can only agree or disagree with him. That also applies for the user.

Btw, improving systems is a bit hard without measuring something.

Performance benchmarks help to balance and support a reviewer's opinions by providing quantifiable measures from standardized tests. I like to look at the individual component scores (CPU, graphics, memory, HDD) to understand a machine's relative strengths and weaknesses. This provides quantifiable data to either support or spur discussion regarding the qualitative perceptions presented in a review.

The "grain of salt" philosophy should be applied to reviews in general -- whether it's computers, cars, restaurants, wine, etc. Consumers should first try to understand their own needs/preferences and then look to a variety of trusted sources to determine whether a product meets those requirements.

Aside: Boot time and standby recovery time are important benchmarks when using an ultra-portable device for PDA functions. I've had to wait up to 3 minutes for my mobile PC to wake up from standby mode just so I could check my calendar and book an appointment. IMO device makers and Microsoft could do a lot more to improve the 'real world' responsiveness of portable computers.

While I mostly agree with you James, there is one area that I'm forced to disagree on: battery life. I fully agree that synthetic benchmarks rarely, if ever, provide any indication of how a system will perform in the real world.

That said, I have to disagree with you when it comes to battery life benchmarks. Why? It's the same mantra that you guys repeat over and over around here: "everyone's usage model is different." What that means here is that your "real world" battery life for a given computer will be different from mine, because we use our computers differently. The advantage of something like a Battery Eater benchmark is that it allows you to say "I get __ hours of use in the real world, but regardless of how you use the system it won't be worse than ___." That means that when I'm considering a new computer, I can say that I'll expect to get about the battery life that you do, because our usage model is probably pretty similar, but also that if I want to do heavy computing tasks (which I occasionally do while mobile) or I want to watch a movie on an airplane (if I didn't have time to put it on the Zune), then I also know the battery won't die halfway through my movie because I know the worst-case battery life. So, when I'm reading reviews, I find it greatly advantageous to have both the subjective "real world" battery life and the "worst case" Battery Eater benchmark so I can get an accurate picture of what to expect from a given system.

Just an idea:

Maybe if you feel like a particular benchmark is overstating or understating real-world performance in a given area, a quick video of the device performing some sample task is great to show how fast/slow things really are.

I remember when Tom's Hardware posted a video of side-by-side P4 3.2 GHz and 3.0 GHz HT systems booting XP, starting Photoshop, then starting Word, then loading the same file. Seeing the applications take different amounts of time to launch was a great demonstration of the dramatic speed increase offered by hyperthreading technology, and was helpful as I was looking to purchase a desktop PC at the time.

I agree with the general opinion that James takes. But I will also say that I don't frequent jkOnTheRun for benchmarks. There are other websites that do that and do that well. I come to jkOnTheRun to get a sense of the overall user experience. For the most part, James handles his equipment for an extended period of time. I look forward to the long term experiences that he talks about. These focal points can be far different from what we read in a one-shot review that looks at how A compares to B. Even knowing what James is looking at matters because it coincides so often with what I am looking at. Constructing a mobile life is not an easy task. Even the discussions on carry bags is useful.

But I'll never disregard the value of benchmarks. They do matter. The best websites that detail benchmarks are consistent in their application of the benchmarks and they give a good general comparison. But the beauty of the Internet is that I'm not limited to one source of information. Long live JKOTR!

While benchmarks might not tell you how a computer will operate in a real world situation, they do offer an important view of how powerful the machine is. In a desktop world, that isn't much of a problem because most modern desktop processors will handle any app that a home user would come across. However when people want to run Photoshop or play games on their UMPC, having an idea of how powerful the system is becomes very important.

Since we are talking about benchmarks ... anyone already seen the benchmarks from eee PC News where the new Atom gets creamed by the Isaiah?

http://translate.google.com/translate?u=http%3A%2F%2Fwww.eeepcnews.de%2F2008%2F04%2F18%2Fintel-atom-benchmarks-via-isaiah-vergleich%2F&langpair=de%7Cen&hl=en&ie=UTF8

Philosophically I agree with you, but practically I'm desperate to see more benchmarking of different laptops, particularly wrt graphics performance.

One of the reasons we're not seeing any Tablets or "Ultraportables" with decent graphics cards is no-one is reporting on a regular basis how underpowered the current options are. Benchmarking can at least help with this.

Should it matter? Possibly not, but my own experience is that Vista with one of those ageing Intel chipsets isn't all fun. Now you are free to disagree, but at the very least we need the information to help us select a machine based on our needs.

What gets me is that every so often people grumble about benchmarks, but nobody ever seems to question these awful unboxing videos and photo sets that are so popular. At least a benchmark gives you SOME indication of a system's relative performance. An unboxing is all about a person who has not used something documenting themselves not using it. That's oh-so-better than any old benchmark and really helps me make informed purchases.

i am not at all surprised to see this comment coming from James, especially considering the nature of this website as a casual mobile tech blog. this is not a hardcore tech site, not even close.

but James, in the professional world when benchmarking is done by professionals (like Toms Hardware) benchmarks can be extremely telling. why? because they do benchmarking scientifically with completely controlled environment/variables, give exact machine specs, use multiple machine comparisons, discuss results in 20-page explanations. i agree that the benchmarks shown on these amateur UMPC websites is absolutely laughable & accomplishes nothing but misinforming the public. they toss up the CPU type & a graphic bar and say "look, Intels better!". but...

what are the full specs of the winning machine?
what are the specs of the other machines?
what OS?
what were the environment settings?
what other applications were running?
what is the power consumption?
on & on...

personally, i really get annoyed at these ghz-to-ghz CPU comparisons that DONT list power consumption. design choices make ghz comparisons almost useless. all that matters is actual performance & at how many watts.

Post a comment

If you have a TypeKey or TypePad account, please Sign In

 

RSS and Mobile-Friendly View

Sponsor Gallery

Become a sponsor »

Contributors

Kevin C. Tofel

James Kendrick

Kevin's gear   JK's gear

Apps Kevin uses, per Wakoopa

Awards

Microsoft MVP Awardees

CNET100 2004Weblog Awards
2004ReadersChoice 2004_BoardOfExperts
Powered by TypePad
Member since 05/2004

Copyright Notice


  • Copyright 2004 - 2008 by Giga Omni Media, Inc. All rights reserved. The content in this RSS feed, as well as the content presented on the web pages of the blog, is provided for your personal non-commercial use only and may not be republished in whole or in part without the express written or verbal consent of the publisher. All rights are reserved.
StatCounter