Sunday, March 25, 2012

How a web application can download and store over 2GB without you even knowing it

I have been experimenting with the HTML5 offline application cache some more over the last few days, doing boundary tests in an attempt to learn more about browser behaviour in edge cases.

One of these experiments was testing the cache quota.

Two weeks ago, I blogged about generating and serving an offline application manifest using ASP.NET MVC. I reused that code to add hundreds of 7MB PDF files to the cache.
public ActionResult Manifest()
{     
    var cacheResources = new List<string>();
    var n = 300; // Play with this number

    for (var i = 0; i < n; i++)
        cacheResources.Add("Content/" + Url.Content("book.pdf?" + i));

    var manifestResult = new ManifestResult("1")
    {
        NetworkResources = new string[] { "*" },
        CacheResources = cacheResources
    };

    return manifestResult;
}
I initially tried adding 1000 PDF files to the cache, but this threw an error: Chrome failed to commit the new cache to the storage, because the quota would be exceeded.

After lowering the number of files several times, I hit the sweet spot. I could add 300 PDF files to the cache without breaking it.

Looking into chrome://appcache-internals/, I can see the size of the cache being a whopping 2.2GB now for one single web application.


As a user, I had no idea that the website I'm browsing is downloading a suspicious amount of data in the background. Chrome (17.0.963.83), nor any other desktop browser that I know of, warns me. I would expect the browser to ask for my permission when a website wants to download and store such an excessive amount of data on my machine.

Something else I noticed, is that other sites now fail to commit anything to the application cache due to the browser-wide quota being exceeded. I'm pretty sure this 'first browsed, first reserved' approach will be a source of frustration in the future.
To handle this scenario we could use the applicationCache API to listen for quota errors, and inform the user to browse to chrome://appcache-internals/ and remove other caches in favor of the new one. This feels sketchy though; shouldn't the browser intervene in a more elegant way here?


What are your thoughts? What would you want your browser to do in these scenarios?

44 comments:

  1. I want my web browser to have some subtle indication of high network I/O, in each tab. It shows synchronous network activity in the toolbar already -- why can't it show asynchronous network activity somehow?

    I also want browser tabs to show me which one is playing sound, BTW, which browser makers seem to not care about, either.

    We really have very little insight into what a webpage is doing in a modern 2012-era web browser.

    ReplyDelete
    Replies
    1. The 2012 era? This didn't happen in the 2011 era!! Don't make me get out black and white photos of the 2010 era!

      Delete
    2. There's good technical reasons that nobody's done a what-tab-is-making-that-sound indicator: the makers are aware it would be nice. But usually the culprit is Flash, and playing sound is between the Flash plugin and the OS, it doesn't go through the browser.

      Delete
    3. I'm pretty sure Flash exposes that information through its javascript api, and seeing as how everything in firefox is javascript I don't think it should be that hard to do.

      Delete
    4. I would assume that the 7 MB pdf file is only downloaded once, so in this case it doesn't have a large effect on bandwidth.

      Delete
    5. I've heard the Flash excuse before. I don't buy it. They also claim that plugins run in their own process (which is why Flash crashing doesn't take down my whole browser). There's lots of third-party apps that let me record audio from any process, and have a handy volume meter to let me find the right one. If a third-party app can do this, then the browser should be able to. It's not that hard.

      The only thing extra the browser would need to do is map the process ID to the tab that contains that plugin instance.

      It doesn't have to go "through" the browser for the browser to be able to see it. It just has to have a way to identify it through the operating system. Coordinating hardware access for multiple processes is what operating systems are for!

      Delete
  2. Great point about sound! Back to the topic, why isn't there a per-site quota on size? Would it have accepted 300 PDFs of any size??

    ReplyDelete
    Replies
    1. no, it was about the sum size not the number of files. so 1 2.2gb file could have been used as well.

      as for a per site quota that makes the problem even worse:
      1.evilsite.com: 2.2gb
      2.evilsite.com: 2.2gb
      3.evilsite.com: 2.2gb
      total: 6.6gb
      even taking it by TLD not simply domain (as most browser functions handle a site as a domain, so cookies for instance are per domain, or xhr or other such thing) we get:
      evilsite.com: 2.2gb
      123.4.567: 2.2gb
      notEvilSite: 2.2gb
      789.0.123: 2.2gb
      total: 8.8gb (needing only 2 TLDs and using the fact that browsers rarely recognize that a named domain is the same as the ip domain)

      this opens up the possibility of completely filling someone's HDD which is worse than the potential DOS described in the article.

      Delete
  3. You're right this is definitely an issue. What I really dislike about this is that apps which abuse this will slow down browsing and download speeds significantly. When you have 50+ tabs open and you don't know which one is the culprit it becomes a serious pain.

    ReplyDelete
    Replies
    1. That's what the task manager in Chrome is for - lets you easily see which tabs are doing what.

      Delete
  4. Doesnt make sense having 4k size for cookies and 2gb cache...

    ReplyDelete
    Replies
    1. Cookies are sent back to the server on every request. Given that, it does make sense to have a low maximum size.

      Delete
  5. The site could download an infinite number of images in the background, correct? 40gb? More?

    I like the idea of making an indicator in the tab for asynchronous downloads. Maybe I could create a Firefox extension for that...

    ReplyDelete
    Replies
    1. Wouldn't these show up in the status bar 'waiting for..'?

      Delete
  6. "As a user, I had no idea that the website I'm browsing is downloading a suspicious amount of data in the background. Chrome (17.0.963.83), nor any other desktop browser that I know of, warns me."

    Firefox asks you to confirm: http://cl.ly/FHjV

    ReplyDelete
    Replies
    1. Which version of FF or you on? I tried with version 10, and no permission was asked.

      Delete
    2. FF version 11 does ask for me.

      Delete
  7. This appears to be a Chrome problem. IOS for instance limits the amount to 5MB, you must request more. And Firefox asks for permission no matter the amount.

    ReplyDelete
  8. I thought Chrome would ask permission to store more than 5 or 20mb locally. Maybe it has to do with using localhost?

    ReplyDelete
    Replies
    1. Hmm, that would be odd. Geolocation asks for permission even if it's localhost. Also when I download Angry Birds (30MB), no permission is asked.

      Delete
    2. I guess it is all a question about trusted sources. Geolocation as an API is supposed to respect the user settings and is a straight browser issue. As for downloading Angry Birds, well, that is part of the Google Chrome App structure, and is explicity a large data in-browser app that Google supposedly has vetted enough to let it into the Crome app store.

      I just looked up what the cache issues are on mobile devices, and they seem to me a lot stricter. This article might be worth referencing in your future tests, especially since Mobile Safari allows using PDF as an image format: http://www.yuiblog.com/blog/2010/06/28/mobile-browser-cache-limits/

      Delete
    3. I pushed the web application to a remote server, and this makes no difference. It's a lot slower though ;)

      Delete
  9. Browsers should provide an activity bar, whose job it is to quickly provide visual idea about storage, CPU and memory utilizations. We can then set browser-wide and site-specific limits from this bar; and browser can explicitly allow us to increase the quota temporarily.

    This can be a browser extension. However, this is important; so it should be part of all browsers.

    ReplyDelete
  10. I often have one tab instance consume a lot of CPU due to something like Flash (advert, video, etc.) pegging one or more CPUs, so yes in general, for any user, I think it would be very helpful to optionally be able to see which tab/browser window is consuming a lot of any resource. Otherwise you need to close a tab/window until you get to the one that is consuming a lot of resources. With HTML5 this will probably be even more useful.

    With more browsers sandboxing each tab, it would seem to me that this would be something that a browser should be able to do. I also agree that it is something that should be built into the browser - how I imagine it implemented would maybe be some small status indications for each tab/window, and a more detailed dedicated tab showing the status/resources used/etc. for each tab/window for that browser instance (or shared instances).

    ReplyDelete
    Replies
    1. Have you tried Shift+Esc in Chrome?

      Delete
  11. What is Shift-Esc supposed to do? I see no reaction...

    ReplyDelete
    Replies
    1. It should show the Chrome task manager (http://support.google.com/chrome/bin/answer.py?hl=en&answer=95672)

      Delete
    2. The shortcut is not supported in Mac OS X.

      Delete
  12. Looks like we are replacing the operating system with the browser without learnning the lessons of the past. HTML5 has advantages but with it comes security/priivacy issues which are even less under the control of the user. With old school applications the operating system at least had a say!

    ReplyDelete
  13. So this is a handy new feature. The "user-upper-wireless-data-quota-in-no-time-er"

    ReplyDelete
  14. False whitness if you're talking about me.

    ReplyDelete
  15. The browser should absolutely handle this. It's the browsers responsibility to maintain boundaries on this cache and provide the user with the tools to be aware and manage it.

    ReplyDelete
  16. In chrome, there is sometimes a warning when files try to use local storage. They should just add that if a page includes a manifest.

    ReplyDelete
  17. Hi, in Firefox 11.0 (Ubuntu) you can see the amount of currently used cache and even set a limit to it. And it also would clear itself after it reach the limit.

    Ohh Yeahh! Firefox Rules!!!

    ReplyDelete
  18. Heyy,, not talking about browsers... .Net Sucksssssssssssssssss you are a looser, so many good languages out there!!!

    ReplyDelete
    Replies
    1. Thank you for this helpful comment.

      Delete
    2. Haha, I can see why you approved it, Jef ;) The insight is just... well... indescribable.

      Delete
  19. Web Application is a dynamic extension of a web or application server that provides a way for the marketers to get to know the people visiting their sites.
    Web Application

    ReplyDelete
  20. This comment has been removed by a blog administrator.

    ReplyDelete
  21. This comment has been removed by a blog administrator.

    ReplyDelete
  22. interesting blog. It would be great if you can provide more details about it. Thanks you


    J2ME Application Development

    ReplyDelete
  23. I think I read that Windows 8 will allow the user to see which apps use how much internet. Will this create users more vigilant on the amount of traffic applications use? Will browsers follow this trend and make individual web site stats more visible? Could this lead to the shaming of "data hog" applications?

    In remote countries like New Zealand coming faster local internet could cause fixed speed international traffic to become much more apparent. Further highlighting to users data hog apps.

    ReplyDelete