banksco.de blog

The State of Real-Time Web in 2016

Paul Banks — Mon, 11 Jan 2016 00:00:00 GMT

I've been working on infrastructure for real-time notifications for a high-traffic site on and off for a few years, and recently been contributing to Centrifuge.

This post is an attempt to sum up how I see the state of the relevant technologies at the start of 2016.

I'll walk through the various techniques for delivering real-time message to browsers. There are good resources for details of each, so I'll instead focus on the gotchas and incompatibilities I've come across that need to be accounted for in the wild.

This information is a mixture of first-hand experience and second-hand reading mostly of well-tested libraries such as SockJS, socket.io and MessageBus.

WebSockets

It's 2016. We are officially in the future. WebSockets are a real standard and are supported in all recent major browsers.

That should really be the end of the article but, as always, it isn't.

Sam Saffron (the author of MessageBus and Co-Founder of Discourse) recently blogged about why WebSockets are not necessarily the future. I found his post truly refreshing as I've run into almost all of the pain points he describes.

That said, Sam's post is focusing on the case where your WebSocket/streaming/polling traffic is served by the same application and same servers as regular HTTP traffic.

There are many reasons I've experienced which suggest this might not be the best approach at scale. Sam even mentions this in his article. I can't say it's always a bad one - Discourse itself is proof that his model can work at scale - but I've found that:

Long-lived requests are very different to regular HTTP traffic whether they are WebSockets, HTTP/1.1 chunked streams or just boring long-polls. For one real-life test we increased the number of sockets open on our load balancer by a factor of 5 or more in steady state with orders-of-magnitude higher peaks during errors causing mass-reconnects. For most websites, real-time notifications are a secondary feature; failure in a socket server or overload due to a client bug really shouldn't be able to take out your main website and the best way to ensure that is to have the traffic routed to a totally different load balancer at DNS level (i.e. on a separate subdomain).
If your web application isn't already an efficient event-driven daemon (or have equivalent functionality like Rack Hijack) long-lived connections in main app are clearly a bad choice. In our case our app is PHP on apache. So handling long-lived connections must occur on separate processes (and in practice servers) with suitable technology for that job.
Scaling real-time servers and load balancing independently of your main application servers is probably a good thing. While load balancing tens or hundreds of thousands of open connections might be a huge burden to your main load balancer as in point 1, you can probably handle that load with an order of magnitude or two fewer socket servers than are in your web server cluster if you are at that scale.

But with those points aside, the main thrust of Sam's argument that resonates strongly with my experience is that most apps don't need bidirectional sockets so the cons of using WebSockets listed below can be a high price for a technology you don't really need. Sam's article goes into more details on some of the issues and includes others that are not as relevant to my overview here so worth a read.

WebSocket Pros

Now supported in all modern browsers.
Efficient low-latency and high-throughput transport.
If you need low-latency, high-throughput messaging back to the server they can do it.
Super easy API - can make a toy app in an hour.

WebSocket Cons

Despite wide browser support, they are still not perfect: IE 8 and 9 and some other older mobile browsers need fallbacks anyway if you care about wide compatibility
There were many revisions and false starts in the history of the WebSocket protocol, in practice you still have to support old quirky protocol versions on the server for wide browser coverage. Mostly this is handled for you by libraries but it's unpleasant baggage nonetheless.
Even when supported in browser, there are many restrictive proxies and similar in the wild that don't support WebSockets or which close connections after some short time regardless of ping activity. If you use SSL things improve a lot as proxies don't get to mangle the actual protocol, but still not perfect.
Due to the two issues above, you almost certainly need to implement fallbacks to other methods anyway, probably using a well-tested library as discussed below.
They work against HTTP/2. Most notably multiple tabs will cause multiple sockets always with WebSocket, whereas the older fallbacks can all benefit from sharing a single HTTP/2 connection even between tabs. In the next few years this will become more and more significant.
You have to choose a protocol over the WebSocket transport. If you do write data back with them, this can end up duplicating your existing REST API.

WebSocket Polyfills

One of the big problems with WebSockets then is the need to support fallbacks. The sensible choice is to reach for a tried and tested library to handle those intricate browser quirks for you.

The most popular options are SockJS and socket.io.

These are both fantastic pieces of engineering, but once you start digging into the details, there are plenty of (mostly well-documented) gotchas and quirks you might still have to think about.

My biggest issue with these options though is that they aim to transparently provide full WebSocket functionality which we've already decided isn't actually what we need most of the time. In doing so, they often make design choices that are far from optimal when all you really want is server to client notifications. Of course if you actually do need bi-directional messaging then there is not a lot to complain about.

For example, it is possible to implement subscription to channel-based notifications with a single long-poll request:

Client sends request for /subscribe?channels=foo,bar
You wait until there is data then request returns (or times out)
Authentication and resuming on reconnect can be handled by passing headers or query params in a stateless way

Yet if you are using a WebSocket polyfill, it's likely that you use some sort of PubSub protocol on top of the abstracted WebSocket-like transport. Usually that means you connect, then send some sort of handshake to establish authentication, then one or more subscribe requests and then wait for messages. This is how most open source projects I've seen work (e.g. Bayeux protocol).

All is fine on a real WebSocket but when the transport transparently reverts to plain old long-polls, this starts to get significantly more complicated than the optimal, simple long-poll described above. Each of the handshake and subscribe messages might need to be sent in separate requests. SockJS handles sending on a separate connection to listening.

Worse is that many require that you have sticky-sessions enabled for the polling fallback to work at all since they are trying to model a stateful socket connection over stateless HTTP/1.1 requests.

The worst part is the combination: poor support for load balancing WebSockets in most popular load balancers and sticky session support. That means you may be forced to use Layer 4 (TCP/TLS) balancing for WebSockets but you can't ensure session stickyness if you do. So SockJS and the like just can't work behind this kind of load balancer. HAProxy is the only one of the most popular load balancing solutions I know of that can handle Layer 7 WebSocket balancing right now which is a pain in AWS where ELBs give you auto-scaling and bypass the need to mess with keepalived or other HA mechanism for your load balancer.

To be clear, the benefits of not reinventing the wheel and getting on with dev work probably outweigh these issues for many applications, even if you don't strictly need bi-directional communication. But when you are working at scale the inefficiencies and lack of control can be a big deal.

WebSocket Polyfill Pros

For the most part they just work, almost everywhere.
The most widely used ones are now battle hardened.
Leave you to write your app and not think about the annoying transport quirks described here.

WebSocket Polyfill Cons

Usually require sticky sessions for fallbacks to work.
Usually less efficient and/or far more complex than needed for simple notification-only applications when falling back due to emulating bi-directional API.
Can run into issue like exhausting connections to one domain in older browsers and deadlocking if you make other XMLHttpRequests to same domain.
Usually don't give you good control of reconnect timeouts and jitter which can limit your ability to prevent thundering herds or reconnections during incidents.

Server Sent Events/EventSource

The EventSource API has been around a while now and enjoys decent browser support - on par with WebSockets. It interacts with a server-protocol named Server Sent Events. I'll just refer to both as "EventSource" from now on.

At first glance it looks ideal for the website notification use-case I see being so prevalent. It's not a bidrectional stream; it uses only HTTP/1.1 under hood so works with most proxies and load balancers; long-lived connection can send multiple events with low latency; has a mechanism for assigning message ids and sending cursor on reconnect; browser implementations transparently perform reconnects for you.

What more can you want? Well...

EventSource Pros

Plain HTTP/1.1 is easy to loadbalance at Layer 7.
Not stateful, no need for sticky sessions (if you design your protocol right).
Efficient for low-latency and high-throughput messages.
Built in message delimiting, ids, and replay on reconnect.
No need for the workarounds for quirks with plain chunked encoding.
Can automatically take advantage of HTTP/2 and share a connection with any other requests streaming or otherwise to the same domain (even from different tabs).

EventSource Cons

No IE support, not even IE 11 or Edge.
Never made it past a draft proposal, is no longer being worked on and remains only in the WhatWG living standard. While it should still work for some time in supported browsers, this doesn't feel like a tech that people are betting on for the future.
Browser reconnect jitter/back-off policy is not under your control which could limit your ability to mitigate outages at scale just as with WebSocket Polyfills above.
Some older browser versions have incorrect implementations that look the same but don't support reconnecting or CORS.
Long-lived connections can still be closed early by restrictive proxies.
Streaming fallbacks below are essentially the same (with additional work to implement reconnect, data framing and message ids) with better browser support - you probably need them anyway for IE.

XMLHttpRequest/XDomainRequest Streaming

Uses the same underlying mechanism as EventSource above: HTTP/1.1 chunked encoding on a long-lived connection, but without browser handling the connection directly.

Instead XMLHttpRequest is used to make the connection. For cross-domain connections CORS must be used, or in IE 8 and 9 that don't have CORS support, the non-standard XDomainRequest is used instead.

These techniques are often refered to as "XHR/XDR Streaming".

XHR/XDR Streaming Pros

Plain HTTP/1.1 is easy to loadbalance at Layer 7.
Not stateful, no need for sticky sessions (if you design your protocol right).
Efficient for low-latency and high-throughput messages.
Can automatically take advantage of HTTP/2 and share a connection with any other requests streaming or otherwise to the same domain (even from different tabs).
Works in vast majority of browsers right back to IE 8 and relatively early versions of other major browsers.

XHR/XDR Streaming Cons/Gotchas

Still doesn't work (cross domain) in really old browsers like IE 7.
XDomainRequest used as fallback for IE 8 and 9 doesn't support cookies so you can't use this if you require both cross domain connection and cookies for auth or sticky sessions.
XDomainRequest also doesn't support custom headers. Possibly not a big deal but notable if you are trying to emulate EventSource which uses custom headers for retry cursor.
Long-lived connections can still be closed early by restrictive proxies.
Even well-behaved proxies will close idle connections, so your server needs to send a "ping" or heartbeat packet. Usually this is sent every 25-29 seconds to defeat 30 second timeouts reliably.
There is a subtle memory leak: the body text of the HTTP response grows and grows as new messages come in consuming more memory over time. To mitigate this, you need to set some limit on this body size and when it's passed close and re-open connection.
In order to defeat load balancer and proxy connection idle timeouts, server needs to periodically send something. Generally 25 seconds is recommended interval to work around proxies with a 30 second timeout.
IE 8 and 9 don't fire XDR progress handler until they reach some threshold response size. This means you might need to send 2KB of junk when request starts to avoid blocking your real events. Some other older browsers had similar issues with XMLHttpRequest but these are so rare they probably aren't worth supporting. In practice you can tell if browser supports XMLHttpRequest.onprogress as an indicator that it will work without padding.
Some browsers require X-Content-Type-Options: nosniff header otherwise they delay delivering messages until they have enough data to sniff (I've seen reports this is 256 bytes for Chrome).
You may have to encapsulate messages in some custom delimiters since proxies might re-chunk the stream and you need to be able to recover original message boundaries to parse them. Google seems to use HTTP/1.1 chunked encoding scheme inside the response text to delimit "messages" (i.e. chunked encoding inside chunked encoding).
The big one: some proxies buffer chunked responses, delaying your events uncontrollably. You may want to send a ping message immediately on a new connection and then revert to non-streaming in the client if you don't get that initial message through soon enough.

XMLHttpRequest/XDomainRequest Long-polling

Same as XHR/XDR streaming except without chunked encoding on response. Each connection is held open by server as long as there is no message to send, or until the long-poll timeout (usually 25-60 seconds).

When an event arrives at the server that the user is interested in, a complete HTTP response is sent and the connection closed (assuming no HTTP keepalive).

XHR/XDR Long-polling Pros

Plain HTTP/1.0 (no chunked encoding) is easy to load balance at Layer 7.
Can automatically take advantage of HTTP/2.
Works even when proxies don't like long-lived connection, or buffer chunked responses for too long.
No need for server pings - natural long poll timeout is set short enough.

XHR/XDR Long-polling Cons

Same cross-domain/browser support issues as XHR/XDR streaming.
Overhead of whole new HTTP request and possibly TCP connection every 25 seconds or so.
To achieve high-throughput you have to start batching heavily either on server or by not reconnecting right away in the client to allow bigger batch of events to queue. If you have lots of clients all listening to high-throughput channel this adds up to a huge amount of HTTP requests unless you ramp up batching to be on order of 20-30 seconds. But then you are trading off latency - is 30 seconds latency acceptable for your "real-time" app?

JSONP Long-polling

The most widely supported cross-domain long-polling technique is JSONP or "script tag" long-polling. This is just like XHR/XDR long-polling except that we are using JSONP to achieve cross-domain requests instead of relying on CORS or XDR support. This works in virtually every browser you could reasonably want to support.

JSONP Long-polling Pros

All the pros of XHR/XDR long-polling.
Works in ancient browsers too.
Supports cookies everywhere (although if your long poll server is a totally separate root domain browsers third-party cookie restrictions might prevent that).

JSONP Long-polling Cons

Largely the same as XHR/XDR Long-polling Cons minus the cross-domain issues

Polling

Periodically issuing a plain old XHR (or XDR/JSONP) request to a backend which returns immediately.

Polling Pros

Very simple, always works, no proxy or load balancing issues
Can benefit from HTTP/2

Polling Cons

Not really "real-time" with an average of half the poll interval latency per message. That might be 5 or 10 seconds.
Perception is that this is expensive if you service the requests via your regular web app - can easily create orders of magnitude more HTTP request on your web servers. In practice you can use highly optimised path on a separate service to mitigate this.

Others

There are many other variants I'm missing out as this is already fairly long. Most of them involve using a hidden iframe. Inside the iframe HTML files with individual script blocks served with chunked encoding or one of the above transports receive events and call postMessage or a fallback method to notify the parent frame.

These variants are generally only needed if you have requirement to support both streaming and cookie enabled transport for older browsers for example. I won't consider them further.

The Future(?)

You may have noticed if you use Chrome that Facebook can now send you notifications even when you have no tab open. This is a Chrome feature that uses a new Web Push standard currently in draft status.

This standard allows browsers to subscribe to any compliant push service and monitor for updates even when your site isn't loaded. When they come in service workers can be called to handle the notification.

Great! Soon we won't have to worry about this transport stuff at all. All browsers will support this and all we'll have lovely open-source libraries to easily implement that backend. (But see update below.)

But that's some way off. Currently Chrome only supports a modified version that doesn't follow standard because it uses their existing proprietary Google Cloud Messaging platform (although they claim to be working with Mozilla on standards compliant version).

Firefox is working on an implementation (in Nightlies) but it's going to be some years yet before there is enough browser support for this to replace any of the other options for majority of users.

I came across this standard after writing most of the rest of this post and I would like to pick out a few points that reinforce my main points here:

It's push only technology
The transport is HTTP/2 server push which can fallback to regular HTTP poll. No WebSockets. No custom TCP protocol. Presumably if you have push enabled over HTTP/2 in your browser, then your actual site requests could be made over it too meaning that in some cases it might even cut down on connection overhead for your main page loads... That's pure speculation though.
The spec explicitly recommends against application-level fallbacks although clearly they will be needed until this spec is supported virtually everywhere which will be at least a few years away.

Update 13th Jan 2016

After reading the spec closely and trying to think about how to use this technology it became clear that it might not be a good fit for general purpose in-page updates.

I clarified with the authors on the mailinglist (resulting in this issue). The tl;dr: this is designed similar to native mobile push - it's device centric rather than general pub/sub and is intended for infrequent message that are relevant to a user outside of a page context. Right now implementations limit or forbid it's use for anything that doesn't display browser notifications. If that's all you need, you may be able to use it in-page too, but for live-updating comment threads in your app where you only care about updates for the thread visible on page, it wont be the solution.

Do you need bi-directional sockets?

My thoughts here have a bias towards real-time notifications on websites which really don't require bi-directional low-latency sockets.

Even applications like "real-time" comment threads probably don't - submitting content as normal via POST and then getting updates via streaming push works well for Discourse.

It's also worth noting that GMail uses XHR Streaming and Facebook uses boring XHR long-polls even on modern browsers. Twitter uses even more unsexy short polls every 10 seconds (over HTTP/2 if available). These sites for me are perfect examples of the most common uses for "real-time" updates in web apps and support my conclusion that most of us don't need WebSockets or full-fidelity fallbacks - yet we have to pay the cost of their downsides just to get something working easily.

Sam Saffron's MessageBus is a notable exception which follows this line of thinking however it's only aimed at Ruby/Rack server apps.

I find myself wishing for a generalisation of MessageBus' transport that can be made portable across other applications, something like SockJS or Socket.io but without the goal of bi-directional WebSocket emulation. Eventually it could support Web Push where available and pave the way for adopting that in the interim before browsers support it. Perhaps an open-source project in the making.

Thanks to Sam Saffron, Alexandr Emelin and Micah Goulart who read through a draft of this very long post and offered comments. Any mistakes are wholly my own - please set me straight in the comments!

Understanding Distributed System Guarantees

Paul Banks — Mon, 24 Aug 2015 00:00:00 GMT

Tyler Treat just published another great article about Distributed Systems and the limited value of strong guarantees they might claim to provide.

I'll start with a word of thanks to Tyler - his blog is a great read and well recommended for his articulation and clarity on many computer science subjects that are often muddled by others.

Tyler's article focuses specifically on distributed messaging guarantees but at least some of the discussion is relevant to or even intimately tied to other distributed problems like data consistency and consensus.

I agree with all of his points and I hope this article is a complement to the discussion on trade-offs in the distributed design space.

The article got me thinking about the inverse question - when is it sensible to incur the overhead of working around reduced guarantees, assuming doing so is non-trivial?

When is Strong Consistency worth it?

Let's consider Google's database evolution (of which I know nothing more than you can read for yourself in these papers).

In 2006 Google's published a paper on BigTable. Unlike Dynamo and others, BigTable made some attempt to guarantee strong consistency per row. But it stopped there; no multi-row atomic updates, certainly no cross-node transaction. Five years later a paper on their MegaStore database was published. The motivation includes the fact that "[NoSQL stores like BigTable's] limited API and loose consistency models complicate application development".

A year later details of Spanner emerged, and in the introduction we discover that despite the high performance cost, engineers inside Google were tending to prefer to use MegaStore over BigTable since it allowed them to get on with writing their apps rather than re-inventing solutions for consistent geographical replication, multi-row transactions, and secondary indexing (my gloss on the wording there).

Google's cream-of-the-crop engineers with all the resources available to them chose to trade performance for abstractions with stronger guarantees.

That doesn't mean that MegaStore and Spanner can never fail. Nor (I guess) that Google's internal users of those systems are blindly assuming those guarantees hold in all cases in application code. But, at least for Google, providing stronger guarantees at the shared datastore level was a win for their engineering teams.

Just because it's Google doesn't make it right, but it is worth noting nonetheless.

Commutativity and Idempotence

A sound conclusion of Tyler's post is that you are better served if you can shape your problems into commutative and idempotent actions which can naturally tolerate relaxed guarantees correctly.

This is undoubtedly true.

But in cases where idempotence or commutativity are not naturally available, at least some of the possible solutions vary little between applications.

For example de-duplicating events based on a unique ID is a common requirement. It is equivalent to attempting to provide exactly-once delivery. Tyler points out this is impossible to do perfectly, nonetheless some applications require at least some attempt at de-duplicating event streams.

Isn't it better, given a case that requires it, to have an infrastructure that has been specifically designed and tested by distributed systems experts that provides clear and limited guarantees about de-duplicated delivery? Certainly if the alternative is to re-solve that hard problem (and possibly incur the significant storage cost) in every system we build that is not trivially idempotent. The centralised version won't be perfect, but in terms of development cost it might well be a cheaper path to "good enough".

Trade-offs

The title of Tyler's article is about "Understanding Trade-Offs". This really is key.

To me the conclusion of the article is spot on: it's better to consider up-front the real guarantees required and what cost they come at. I just want to add the (probably obvious) idea that application code complexity, re-solving similar problems and the extent to which every application developer needs to be a distributed systems expert are real costs to throw into the trade-off

The Google school of thought argues that it's better to favour simplicity for application developers and pay the performance cost of that simpler model until the application really needs the increased performance - at this point the additional application complexity can be justified.

This is orthogonal to Tyler's point that the application developer needs to have clarity on where and how the claimed guarantees of a system break down and how that affects the correctness or availability requirements of their own application. To me that's a given, but I don't think it devalues systems that do attempt to provide higher-level guarantees, provided they are properly understood.

Google's AdSense/DFP PII Privacy Gotcha

Paul Banks — Tue, 07 Oct 2014 00:00:00 GMT

Google AdSense (and other advertising products) appear to have turned on a new detection system for violations of their PII policy. Here are a couple of easy steps to fall foul of it without meaning to.

What is a PII (Personally Identifiable Information) Violation?

Google's support document for their privacy policy describes the issue well.

Specifically the line:

In particular, please make sure that pages that show ads by Google do not contain your visitors' usernames, passwords, email addresses or other personally-identifiable information (PII) in their URLs.

Easy right? Just don't be dumb and pass the visitors info in URL.

Here are a couple of ways to fail at that automatically - you may well be doing them yourself right now.

Fail #1: Your Search Results Page

If like every other site on the Internet, you have a search feature, and like every other search engine (copying Google themselves) you present the search results at a URL with the user's query in a GET parameter e.g. search/?q=ponies, then you will probably violate AdSense policy eventually.

All but one of the handful of breach notifications I've come across from Google are due to people searching for an email address. That obviously results in a URL that looks like search/?q=user@exmaple.com and Google's apparently new automated detection of PII violations flags that.

As an aside, it's quite likely that they are not searching for their own email address and so it's not really PII, but who can know? Google will see an email address in URL and call it a breach.

But that's pretty standard behavior for a search box right? We'll see what we can do in a bit.

Fail #2: Users Saving Your Content For "Offline Use"

Less obviously, you may have users who like to "save" pages on your site for "offline" use. Of course if they really are offline when they view it they will not make any ad requests to Google and you'll be fine.

But in at least one case I've come across, a user saved a page on a site which uses Google AdSense to a folder on their machine like /Users/name@example.com/stuff/your-page.htm. They must have loaded this in their browser while still online and the resulting ad calls from the JS embedded in the page when they saved it fire off to Google with ?url=file:///Users/name@example.com/... in the URL, violating their policy.

Anti-Solution

Actually in both of these cases it's a little hard to find a good solution. For the first you could perhaps have a special encoding of email addresses in search JS but that goes against the spirit of the policy - the email info is still there if it is in fact PII (probably not if visitor is searching for it but there you go). Not to mention relying on JS instead of regular GET form submission.

It's a nasty hack and misses many other cases. The second example is one of the many possible reasons you would probably never consider which would not be covered by sticking plasters like that.

Solution

It would seem the most pragmatic solution is to adjust your ad serving logic to scan URL and referrer for email addresses and just opt to not show ads on that page if you find any. Hopefully it's rare enough not to lose you too much revenue and the alternatives are likely more expensive.

One thing to note though: checking for email addresses on server side is probably not enough. It wouldn't have caught that second case above and there are many other cases where it might not work.

So if you rely totally on Google's provided libraries to display your ads, you may need to write a small wrapper to handle this case on the client side. It seems like this should be something Google's JS does automatically or at least something you can opt-in to.

LMDB: The Leveldb Killer?

Paul Banks — Thu, 15 Aug 2013 00:00:00 GMT

I've been quiet for a while on this blog, busy with many projects, but I just had to comment on my recent discovery of Lightning Memory-Mapped Database (LMDB). It's very impressive, but left me with some questions.

Disclaimer

Let me start out with this full acknowledgement that I have not yet had a chance to compile and test LMDB (although I certainly will). This post is based on my initial response to the literature and discussion I've read, and a quick read through the source code.

I'm also very keen to acknowledge the author Howard Chu as a software giant compared to my own humble experience. I've seen other, clearly inexperienced developers online criticising his code style and I do not mean to do the same here. I certainly hope my intent is clear and my respect for him and this project is understood throughout this discussion. I humbly submit these issues for discussion for the benefit of all.

Understanding the Trade-offs

First up, with my previous statement about humility in mind, the biggest issue I ran up against when reviewing LMDB is partly to do with presentation. The slides and documentation I've read do a good job of explaining the design, but not once in what I've read was there any more than a passing mention of anything resembling a trade-off in the design of LMDB.

My engineering experience tells me that no software, especially when attempting to claim "high performance" comes without some set of assumptions and some trade-offs. So far everything I have read about LMDB has been so positive I'm left with a slight (emphasis important) feel of the "silver bullet" marketing hype I'd expect from commercial database vendors and which I've come to ignore.

Please don't get me wrong, I don't think the material I've reviewed is bad, just seems to lack any real discussion of the downsides - the areas where LMDB might not be the best solution out there.

On a personal note, I've found the apparent attitude towards leveldb and Google engineers a little off-putting too. I respect the authors opinion that LSM tree is a bad design for this purpose but the lack of respect toward it and it's authors that comes across in some presentations seems detrimental to the discussion of the engineering.

So to sum up the slight gripe here: engineers don't buy silver-bullet presentations. A little more clarity on the trade-offs is important to convince us to take the extraordinary benchmark results seriously.

[edit] On reflection the previous statement goes too far - I do take the results seriously - my point was more that they may seem "to good to be true" without a little more clarity on the limitations. [/edit]

My Questions

I have a number of questions that I feel the literature about LMDB doesn't cover adequately. Many of these are things I can and will find out for myself through experimentation but I'd like to make them public so anyone with experience might weigh in on them and further the community understanding.

Most of these are not really phrased as questions, more thoughts I had that literature does not address. Assume I'm asking the author or anyone with insight their thoughts on the issues discussed.

To reiterate, I don't claim to be an expert. Some of my assumptions or understanding that lead the the issues below may be wrong - please correct me. Some of these issues may not be at all important in many use cases too. But I'm interested to understand these areas more so please let me know if you have thoughts or preferably experience with any of this.

Write Amplification

It seems somewhat skimmed over in LMDB literature that the COW B-tree design writes multiple whole pages to disk for every single row update. That means that if you store a counter in each entry then an increment operation (i.e, changing 1 or 2 bits) will result in some number of pages (each 4kb by default) of DB written to disk. I've not worked out the branching factor given page size for a certain average record size but I guess in realistic large DBs that could be in the order of 3-10 4k pages written for a single bit change in the data.

All that is said is that "it's sequential IO so it's fast". I understand that but I'd like to understand more of the qualifiers. For leveldb in synchronous mode you only need to wait for the WAL to have the single update record appended. Writing 10s of bytes vs 10s or 100s of kbytes for every update surely deserves a little more acknowledgement.

In fact if you just skimmed the benchmarks you might have missed it but in all write configurations (sync, async, random, sequential, batched) except for batched-sequential writes, leveldb performs better, occasionally significantly better.

Given that high update throughput is a strong selling point for leveldb and the fact that LMDB was designed initially for a high-read ratio use case I feel that despite the presence in stats all of the rest of the literature seems to ignore this trade-off as if it wasn't there at all.

File Fragmentation

The free-list design for reclaiming disk space without costly garbage collection or compaction is probably the most important advance here over other COW B-tree designs. But it seems to me that the resulting fragmentation of data is also overlooked in discussion.

It's primarily a problem for sequential reads (i.e. large range scans). In a large DB that has been heavily updated, presumably a sequential read will on average end up having to seek backwards and forwards for each 4k page as they will be fragmented on disk.

One of the big benefits of LSM Tree and other compacting designs is that over time the majority of the data ends up in higher level files which are large and sorted. Admittedly, with leveldb, range scans require a reasonable amount of non-sequential IO as you need to switch between the files in different levels as you scan.

I've not done any thorough reasoning about it but seems from my intuition that with leveldb the relative amount of non-sequential IO needed will at least remain somewhat linear as more and more data ends up in higher levels where it is actually sequential on disk. With LMDB it seems to me that large range scans are bound to perform increasingly poorly over the life of the DB even if the data doesn't grow at all, just updates regularly.

But also, beyond the somewhat specialist case of large range scans, it seems to be an issue for writes. The argument given above is that large writes are OK because they are sequential IO but surely once you start re-using pages from the free list this stops being the case. What if blocks 5, 21 and 45 are next free ones and you need to write 3 tree pages for your update? I'm aware there is some attention paid to trying to find contiguous free pages but this seems like it can only be a partial solution.

The micro benchmarks show writes are already slower than leveldb but I'd be very interested to see a long-running more realistic benchmark that shows the performance over a much longer time where fragmentation effects might become more significant.

Compression

The LMDB benchmarks simply state that "Compression support was disabled in the libraries that support it". I understand why but in my opinion it's a misleading step.

The author states "any compression library could easily be used to handle compression for any database using simple wrappers around their put and get APIs". But that is totally missing the point. Compressing each individual value is a totally different thing to compressing whole blocks on disk.

Consider a trivial example: each value might look like {"id": 1234567, "referers": ["http://example.com/foo", "https://othersite.org/bar"] }. On it's own gzipping that value is unlikely to give any real saving (the repetition of 'http' possibly but the gzip headers is more than the saving there). Whereas compressing a 4k block of such results is likely to give a significant reduction even if it is only in the JSON field names repeated each time.

This is a trivial example I won't pursue and better serialisation could fix that but in my real-world experience most data even with highly optimised binary serialisation often ends up with a lot of redundancy between records - even if it's just in the keys. Block compression is MUCH more effective for the vast majority of data types than the LMDB author implies with that comment.

Leveldb's file format is specially designed in such a way that compression is possible and effective and it seems Google's intent is to use it as a key part of the performance of the data structure. Their own benchmarks show performance gains of over 40% with compression enabled. And that is ignoring totally the size on-disk which for many will be a fairly crucial part of the equation especially if relatively expensive SSD space is required.

One argument might be that you could apply compression at block level to LMDB too but I don't think it would be easy at all. It seems like it relies on fixed block size for it's addressing and compressing contents and leaving blanks gives no disk space saving and probably no IO saving either since all 4k is likely still read from disk.

I'm pretty wary of the benchmarks where leveldb has compression off since I see it as a fairly fundamental feature of leveldb that it is very compression friendly. Any real implementation would surely have compression on since there are essentially no downsides due to the design. It's also baked in (provided you have the snappy lib) and on by default for leveldb so it's not like it's an advanced bit of tuning/modification from basic implementation to use compression for leveldb.

Maybe I'm wrong and it's trivial to add effective compression to LMDB but if so, and doing it would give ~40% performance increase why is it not already done and compared?

I'd like to see the benchmarks re-run with compression on for leveldb. Given writes are already quicker for leveldb this more realistic real-world comparison might well give a better insight into the tradeoffs of the two designs. If I get a chance I will try this myself.

Large Transactions Amplify Writes Even Further

LMDB makes a big play of being fully transactional. It's a great feature and implemented really well. My (theoretical) problem is to do with write performance - we've already seen how writes can be slower due to COW design but how about the case when you update many rows in one transaction.

Consider worst case that you modify 1 row in every leaf node, that means that the transaction commit will re-write every block in the database file. I realise currently that there is a limit on how many dirty pages can be accumulated by a single transaction but I've also read there are plans to remove this.

Leveldb by contrast can do an equivalent atomic batch write without anywhere near the same disk IO in the commit path. It would seem this is a key reason leveldb is so much better in random batch write mode. Again I'd love to see the test repeated with leveldb compression on too. [edit] On reflection, probably not such a big deal - writes to the WAL in leveldb won't be affected by compression. [/edit]

It may not be a problem for your workload but actually it might. Having certain writes use so much IO could cause you some real latency issues and given single writer lock, could give you similar IO-based stalls that leveldb is known for due to it's background compaction.

I'll repeat this is all theoretical but I'd want to understand a lot more detail like this before I used LMDB in a critical application.

Disk Reclamation

Deleting a large part of the DB does not free any disk space for other DBs or applications in LMDB. Indeed there is no built in feature or any tools I've seen that will help you re-optimise the DB after a major change, nor help migrate one DB to another to reclaim the space.

This may be a moot point for many but for practical systems, having to solve these issues in the application might add significant work for the developer and operations teams where others (leveldb) would eventually reclaim the space naturally with no effort.

Summary

I feel to counter the potentially negative tone I may have struck here, I should sum up by saying LMDB looks like a great project. I'm genuinely interested in the design and performance of all the options in this space.

I would suggest that a real understanding of the strengths and weaknesses of each option is an important factor in making real progress in the field. I'd humbly suggest that, if the author of LMDB was so inclined, including at least some discussion of some of these issues in the docs and benchmarks would benefit all.

I'll say it again if Howard or anyone else who's played with LMDB would like to comment on any of these issues, I'm looking forward to learning more.

So is LMDB a leveldb killer? I'd say it seems good, but more data required.

Meet Handlebars.js

Paul Banks — Mon, 31 Dec 2012 00:00:00 GMT

In making this blog I ended up using Yehuda Katz' Handlebars.js for templating. It has some intersting features I'll introduce here, but arguably dilutes Mustache's basic philosophy somewhat.

I found Handlebars to be a powerful extension to Mustache but I want to note up-front that it quite possibly isn't the best option in every case. Certainly if you need implementations outside of Javascript it's not (yet) for you, however I'm also aware that the extra power added comes with a potential cost: you can certainly undo many of the benefits of separating logic and template.

With that note in place. I'll introduce the library.

Why Handlebars?

Yehuda has already outlined his rationale for creating Handlebars so I won't go into too much detail here. The important goals can be summed up as:

Global, contextual helpers — Mustache allows helper methods in views but they must be defined in the view object (Yehuda calls this the "context" object so I'll keep that terminology from now on). Further, there is intentionally no way to pass arguments to these methods so even if they are defined globally and "mixed-in" of inherited into each view, they are fairly limited in scope.
More flexibility with accessing data from parent contexts — inside blocks, Mustache makes it tricky to access properties of the parent scope outside the block.
Precompilation support — you can pre-compile templates into native JS code. In browser context this saves the from client the string parsing overhead.

I encourage you to read his article for a lot more detail and explanation of those points but we'll crack on for now.

I won't cover all the features here. You can read them in the documentation. For now I want to highlight the power (and possible danger) of helpers.

In the case of my static site generation system, my main goal was to have a very thin layer of logic on top of simple content-with-meta-data files with some simple naming conventions. I wanted flexibility in the templating system so that I could generate menus or listings of content without writing extra code for each case.

With Mustache, this flexibility had to happen in the view layer and so became a little clumsy to express in a general and extensible way the data sets required for any page.

It turned out to be much neater and require a lot less "magic" code to be able to make the templates a little more expressive. Helpers were the key.

Helpers

Handlebars adds to Mustache the ability to register helpers that can accept contextual arguments. Helpers are simply callbacks that are used to render {{mustaches}} or {{#blocks}}{{/blocks}}. They can be registered globally or locally in a specific view. We'll use global registration here to keep examples clearer.

Here's a basic example of a block helper that could be used for rendering list markup.

{{title}}
{{#list links}}
    "{{url}}">{{name}}
{{/list}}

Here's the context used

{
    title: 'An example',
    links: [
        {url: 'http://example.com/one', name: 'First one'},
        {url: 'http://example.com/two', name: 'Second one'}
    ]
}

And here is the list helper definition:

Handlebars.registerHelper('list', function(links, options){
    var html = "\n";
    for (var i = 0; i < links.length; i++) {
        html += "\t" + options.fn(links[i]) + "
\n";
    }
    return html + "
\n";
});

When you compile this and render with the context data above, you would get the following output:

An example

    First one
    Second one

You can read similar examples in the documentation which have much more complete explanations of the details here but the basics should be clear:

The helper is called like a regular Mustache expression — in this case it's a block but non-blocks work too.
The links array from the context data is passed as an argument. Handlebars allows passing more than one argument or no arguments. The options arg is always present as the last one, any others before that are positional, passed through from the template expression. (you can also use non-positional hash arguments.)
The options arg is passed a hash containing a few things. Most significant here is options.fn which is the compiled template function for the block's content. That means you can call it with either the current context this or some other data context. In this example we are passing links[i] which means the inner block can use {{url}} and {{name}} directly form the current link's context.

With that very brief overview example, I want to move on to more interesting examples. If you want to read more about the specifics about what happened there then I'd encourage reading the block helpers documentation.

Helpers for Content Selection

Before I continue, I need to acknowledge that what follows breaks everything you know about MVC separation of concerns. I know. Bear with me for now.

My site generation system builds the site files based on filesystem naming conventions. For things like the blog home page I wanted to show the 5 most recent blog posts.

Internally the system reads the whole content file structure and builds an in-memory model of the content. Each directory has two indices: one for all articles with a date in the file name (most recent first) and an index of all other article files in alphabetical order. You can then get the object representing that directory and list the articles in either the date-based or name-based index.

For convenience, I developed an internal API that made this easy using "content URLs" for example Content.get('/p/?type=date&limit=5') which will return the most recent 5 dated articles in the /p/ directory.

From there it is pretty simple to be able to make a block helper that allows templates like this:


{{#pages '/p/?type=date&limit=5'}}
    "{{url}}">{{title}}
{{/pages}}

The pages helper accepts a string argument (the internal content URL) and uses it to fetch the relevant page objects from the content model.
In this case options.fn is passed the page object itself so can render any property of the page.

Next and Previous

But listings aren't the only case this is useful. On the bottom of each blog article I have links to next/previous articles (if they exist) and these need the URL and title of the neighbouring items in the dated index.

I did this with another couple of block helpers. The blog template looks a bit like this:

{{title}}
{{{content}}}

    {{#prev_page}}
        "{{url}}">« {{title}}
    {{/prev_page}}
    {{#next_page}}
        "{{url}}">{{title}} »
    {{/next_page}}

The helper itself uses this which is the current context (in this case the main blog article being displayed). It then looks up in the content index the article's parent directory, and locates the previous or next item in the index relative to the current one. It then calls options.fn with the neighbouring article object as context.

Pushing the Boundaries

From here there is a lot of grey areas you could probe with this powerful construct. For example, let's assume you have different modules of your app rendering themselves and then being combined by some layout controller and rendered into a layout.

What if you wanted to have the module's external CSS or JS requirements actually defined in the template that really has the dependency. Right off the bat, I'll say I can't think of a real reason you'd want this and not have it taken care of outside of the templating layer, but…

You could have a helper for ensuring the correct CSS is loaded up-stream in the template like:

{{add_css 'widget.css'}}
class="widget">
    ...

And then have the helper defined such that it adds the arguments passed to the layout controller and returns nothing to be rendered.

Then the layout rendering might link those CSS assets in the head.

You're right. This is almost certainly a bad idea. I mention it because it was something that occurred to me for a second before I recognised that is was an example of probably dangerous usage. When you get the hang of a powerful concept like this it's easy to start seeing every problem that can be possibly solved with it as a good candidate.

As with all powerful programming concepts and libraries, there are many things you can do with Handlebars helpers that are really bad ideas. Hence my note of caution at the start.

Conclusion

I'm quite happy with the extra power Handlebars has given me in this context. But I'm certain that with the extra power comes the inevitable responsibility. It is certainly possible to write crazy and unmaintainable code if you get too creative with helpers without thought.

The examples here are probably not best practice for an MVC web-app context. But here in a site generation script with an already in-memory content model, it allowed me to extend the expressiveness of the system without hard-coding a lot of specific logic for different cases in the model layer.

Handlebars.js has many more features than I have touched on here. Check it out. It may just be what you are looking for if you really like Mustache's philosophy but have a need (and the discipline) to make more expressive helpers.

Fancy New Blog

Paul Banks — Sat, 29 Dec 2012 00:00:00 GMT

Same poor content, new styling (and backend)

I made my blog a few years ago as a way to learn Rails and after a year in which I posted only two new articles of very low value, I got inspired to give it a revamp.

This article is a long-and-yet-brief overview of the changes..

Design

The style just suits my tastes better. When I made the last version, I was much more focussed on learning Rails and the aesthetics of the site became somewhat secondary. I was inspired by some beautiful and elegant sites I've seen recently and this design was the result (for now). I have dreams of adding beautiful imagery and other fancy things to some articles too, but we'll see.

Fonts are from Google's Webfonts rather than TypeKit's free plan because there is much more freedom without fees. I also get to download the fonts I use here for offline use.

Technology

I went back to basics for this. Since I made the last version of this blog, my tastes in tech have changed a bit. I've come to value simplicity and efficiency more and more. Having a full Rails stack, web servers, proxies, database, user authentication, SSL certificates etc. suddenly seems like a really ugly solution for what is essentially a simple, static site only updated by me.

So this site is a static site.

After I finished a lot of the work on this new system described below, I came across an article from a friend of mine about his CMS solution. It turns out he had a lot of the same ideas and he does a great job of expressing his rationale for moving away from Wordpress-like apps for CMS. I link to that now to save you from more clumsy words from me repeating many of the same things less eloquently.

Managing Static Content

So this site is just HTML files served by good old Apache. Nginx probably would be my first choice on a dedicated server but I'm enjoying my current stay on Webfaction and this is the most appropriate configuration here.

Managing a static site by hand is so 1990s, clearly we can do better than that.

There are actually a bunch of great static CMSs out there that would have been great, Jekyll (Github Pages), Statamic (Commercial) and Kirby being the main ones I came across. Typically, I ended up building my own for no terribly good reason other than it being a good excuse to learn something and end up with exactly the features I need.

The site generation is done by a Node.js app. The content is managed through the file system with a simple naming scheme allowing for articles to participate in ordered indices. For example, if the file name begins with a YYYY-MM-DD date format then it will be added to a newest-first by-date index for that directory. More on these indices later.

The content files themselves are then simply Markdown files with yaml-front-matter to add some meta-data to each. Meta-data typically includes the title (so it can be re-used for page title and in listings/RSS) and a template file to use to render that page.

Templating uses Handlebars which is Mustache with a little more flexibility. This extra flexibility becomes really useful in conjunction with the content indices I mentioned before. For example, all the posts on this blog are dated files in the /p/ directory. To generate the listing on the front page of the site I just need to make a static page called /index.md with meta-data assigning a template that does something like:

{{#pages '/p/?type=date&limit=5'}}
    
        {{> blog_post_body.mu }}
    
{{/pages}}

And my custom pages helper can go and find the date index for the /p/ directory and pull out most recent five articles.

As well as being defined per file in YAML front matter, meta-data defaults for a whole directory can be set in a defaults.yml file (e.g. all blog posts use the same template so it is declared once in /p/defaults.yml) and these defaults are inherited through the content directory hierarchy.

There is another special content file naming convention for specifying RSS feeds (e.g. /blog.rss.yml -> /blog.rss.xml) where the feed meta-data and an internal "content URL" like the one in the template example above are used to generate an XML RSS feed.

Finally, any files that are not .md or .yml in the content directory are copied directly (symlinked) into the final public document root, so that all static assets like images, JS and CSS can be kept versioned with the rest of the content and the entire document root is managed by the generator script.

Publishing

Content is edited through file system and kept version-controlled in a git repository along with the templates and (currently) the node app that generates the site.

I installed a post-update hook in the git repo on the web-server that automatically checks out HEAD and re-runs the generation script. So I can deploy changes by editing files locally, committing, and then running git push production.

I have toyed with the idea of building a web interface for editing. In fact I did have a working prototype using EpicEditor and a node.js REST API (using restify) for editing in an earlier version of the system. But, having settled on the simplicity of fully version-controlled content and no daemons or security to worry about on the server, I'm sticking with local edit and git-push deployment for now.

I am using Mou to write this right now with instant, correctly-styled preview. It works really well, especially when tied it into Sublime Text 2 which I am using to edit the rest of the templates and js files.

Conclusion

I like it. It's been fun to think about and build and has lots of potential for future tinkering.

I may even stick a skeleton version of the site with generation scripts etc. on github although I doubt anyone could have a real desire to use this over one of the more widely used and much better-tested options I listed above.

Now I just need to try to focus on producing some interesting content…

PHP Arrays (Again)

Paul Banks — Thu, 15 Nov 2012 00:00:00 GMT

I have mentioned PHP array inefficiency a few times on this blog.

Discussing it at work today someone linked me to a much more thorough review of the topic that is interesting and readable.

I'm finding myself so much more interested in this level of stuff than the typical PHP programmer which I guess is why I spend my free time playing with other, generally statically typed, languages...

Moving Data and Telling People About It

Paul Banks — Sat, 25 Feb 2012 00:00:00 GMT

I published an article for work on our recent database migration that I was involved with.

It was an unorthodox approach which seemed to work well for this particular dataset and hardware/time constraints but certianly not perfect. Some interesting discussion followed.

I'm also secretly quite proud I got a mention on highscalability.com's weekly roundup. Thanks Todd!

Catch Up

Paul Banks — Wed, 09 Nov 2011 00:00:00 GMT

I've not posted for ages. So here is a summary of a bunch of stuff I've been looking at for fun.

Machine Learning

First up, after finishing the MIT Introduction to Algorithms lectures, I was excited to hear about Stanford's free computer science courses. They are full, taught and (machine) assessed university modules for free! I'm studying Machine Learning and I am really impressed with the quality of the teaching. Thanks Stanford.

There is of course speculation that this is a trial for a new paid remote service. To be honest I feel the quality of the course I've done would be worth paying for if they could find a way to acredit something as a real qualification without proper human assessment.

C++ Experiments

Following on from my experiments with LevelDB, I have played around with creating a C++ gossip implementation based on Cassandra's using ZeroMQ. I spent a lot of time getting a really basic grasp on the intricacies of threading vs event driven style + message passing etc. Ended up with multiple processes on same machine (different ports) gossiping and effectively sharing cluster state. Didn't get around to implementing the full phi accrual failure detection for machine/up down inference and I'm sure the code would need to be torn apart and re-written for anythign resembling real use, but a good learning exercise.

I've now moved on to fiddling about with on-disk data structures. So far I'm mostly just learning. I've read through the specs for SQLite's db file and some articles on CouchDB's Copy-on-write B-tree (not to mention LevelDB/Cassandra's LSM trees). I've also read Acuna's paper on Stratified B-Trees which is all really interesting stuff. Not quite sure what I want to implement now but I may start with trying to get a basic block and free-list allocator working. Just the experience of actually working with C++ and "real" algorithms is fascinating for me, a lowly PHP developer.

In summary then, I'm still doing loads of geeky computer stuff, just forgetting to write about any of it.

LevelDB Fun

Paul Banks — Sat, 17 Sep 2011 00:00:00 GMT

Google recenetly open-sourced LevelDB which is "a fast key-value storage library". I've used it as an excuse to play about in C++.

There is nothing new or exciting to report to the tech world here - just that I've enjoyed playing about in a language I've not worked much with in the past.

So far I have hooked up libevent to LevelDB and made my own little Key-Value database server that can accept multiple clients.

I've also written a C++ client library to talk to it and made up my own Ascii-based data transfer format.

None of this is useful to anyone other than me - it's great to actually play around with a language like this and to get a feel for it. Much more productive that code tutorials or algorithm exercises.

More Efficient PHP Arrays

Paul Banks — Wed, 24 Aug 2011 00:00:00 GMT

One of my first posts here was about how surprisingly inefficient PHP arrays can get. Today I learned of a solution that is probably a lot better than my PHP string serialisation. It's an extension called intarray.

The extension exposes integer-only arrays as strings to PHP but provides several useful methods for interacting with them such as sort, slice and binary search. This means if you are using PHP arrays to store sets of integers, you will likely see a very large improvement in speed and memory usage using this extension.

I've yet to do any real benchmarking but I thought I'd post this as a follow-up from my original post. I know at least one very large site who has used this extension in production with no issue although I obviously urge anyone to evaluate stability etc of any software themselves before deploying.

PHPUnit's Expensive SetUp

Paul Banks — Tue, 09 Aug 2011 00:00:00 GMT

I've been working a lot with PHPUnit 3.5 recently. It's good in many ways but it is not fast. That's understandable perhaps given the feature set but there is one apparently obvious oversight which totally ruins the experience.

The problem I'm talking about I've reported as a possible bug and yet it has gotten zero attention in over two months. I'll describe it again here.

The Problem

PHPUnit has a whole multitude of ways to construct a test suite and pick which test to run. Using the command line runner, you can specify specific test case files or dirs and you can use filter and group options to further restrict.

The problem is that, whatever you pass as filter or group arguments, all setUp() and setUpBeforeClass() methods in all test cases loaded will be run. That's because filtering is applied after setup methods called. I really don't see the rationale behind that decision.

At work we have a large test suite. One part of it is for our database layer and as such has some very expensive setup routines which setup an entire test environment in our test db. Even when you limit the runner to a specific test case, that may mean this very expensive setup operation has to run for every test in the file - even when you are just trying to work with a single test method.

But we shouldn't have to fiddle about with specifying specific test cases. The filter and group options are powerful and (should be) very useful for cherry picking from a suite. This seemingly obvious error totally ruins them and makes working with big test suites decidedly awkward.

Even more confusing is the fact that no-one else I've seen online seems to think this is a problem. I've found no other mention of the behaviour and zero interest in my ticket. Did I miss something here? Is there an obvious reason that setup should be run all the time even when filtering tests? All my colleges and other PHP developers I've mentioned this to personally seem to agree it is very odd behaviour. I'd have expected many people to be using PHPUnit with large suites. Does no one else wonder why running a single simple test can take minutes?

I hope I can update this post when something changes, but I've not been encouraged by the response to my ticket so far.

Opera's fixed position problem

Paul Banks — Tue, 17 May 2011 00:00:00 GMT

Opera has a bug with node.offsetTop when the node has a fixed position ancestor. That has been know about for a while. I didn't think it would take me three separate days of pain to get a handle on.

It turns out that it's quite a lot more complex than that page linked above suggests. At work we have an old utility library called Ruler.js which does a lot of things relating to measuring element positions relative to all sorts of things. While trying to fix a JS positioning issue caused by this Opera bug I though I could simply correct it by compensating for Opera's extra scroll pixels for fixed position nodes in the offset hierarchy.

The problem is that Opera isn't even consistently wrong. It only breaks for elements with display: inline or inline-block (as far as I can tell) and only if they are positioned relatively (either explicitly or implicitly).

What is more, if there is any other element in the offset hierarchy between the element you are measuring and the fixed position one, then the results change. In some cases an explicit position:relative fixes the behaviour completely.

A totally non-exhaustive demo shows how odd some results can be. Note that this doesn't really demo the extent of the problems when trying to walk up the offset tree and correct for scroll position etc which was what made the real diagnostics so much more complex.

Here is the output in Safari 5 and Opera 11.10 side by side:

So there turned out to be no sensible way to even detect if the position had been mangled by Opera in JS. I resorted to having to add position:relative to a parent element where it had no other affect to resolve this case. That also means I'll probably run into this again in the future so i thought I'd document it here!

Trying to make JS go OO

Paul Banks — Thu, 12 May 2011 00:00:00 GMT

At work, we use a version of Base.js by Dean Edwards to standardise object inheritance and make our JavaScript somewhat more Object Oriented. Today I came across a quirk.

Now the problem I found isn't with Base.js really - it's an inherent feature of JS's prototyping and object model, however it was made more confusing by Base.js apparently giving you 'class' like definition of objects. After discussing some unintuitive behaviour with my colleagues it became clear that this is basic JavaScript behaviour despite being somewhat confusing at first.

I'm sure this issue has been brought up in other Base.js discussions before, in fact Dean's latest version may even have solved it - we are using a somewhat legacy version. But the underlying JS issue was interesting to me so I thought I'd write it up for future reference.

Instance properties

If you don't know what Base.js does, read about it on Dean's blog. This basic example shows the creation of a 'Class' with a property which behaves as you would expect:

::javascript::
var TestClass = Base.extend({
    prop: null,
});
var a = new TestClass, b = new TestClass;
a.prop = 'A';
console.log(a.prop, b.prop); // Log: A null

But what if your property was an array?

::javascript::
var TestClass = Base.extend({
    prop: [],
});
var a = new TestClass, b = new TestClass;
a.prop.push('A');
b.prop.push('B');
console.log(a.prop, b.prop); // Log: ["A", "B"] ["A", "B"]

Wait, what happened there?

Well if you think about what Base.js is doing, it is kind of obvious. JavaScript ALWAYS points to references of objects. That is, any variable that is not a primitive type is a reference.

Consider the non-Base.js equivalent of above (roughy the same as what Base.js is doing under the hood):

::javascript::
var TestClass = function(){};
TestClass.prototype.prop = []; // Assigning a pointer to this specific empty array
var a = new TestClass, b = new TestClass;
a.prop.push('A');
b.prop.push('B');
console.log(a.prop, b.prop); // Log: ["A", "B"] ["A", "B"]

Here it is somewhat more obvious what happens. Each instance 'inherits' the prop property from it's prototype but that isn't an empty array, it is a pointer to the specific array the prototype was initialised with.

So the solution is to do this initial assignment in the constructor:

::javascript::
var TestClass = Base.extend({
    prop: [],
    constructor: function(){
        this.prop = []; // this is now this instance, creating a new array just for this instance
    }
});
var a = new TestClass, b = new TestClass;
a.prop.push('A');
b.prop.push('B');
console.log(a.prop, b.prop); // Log: ["A"] ["B"]

Or the equivalent non-base code:

::javascript::
var TestClass = function(){
    this.prop = [];
};
var a = new TestClass, b = new TestClass;
a.prop.push('A');
b.prop.push('B');
console.log(a.prop, b.prop); // Log: ["A"] ["B"]

Back to basics

This is really very basic Javascript but pseudo-OO frameworks like Base.js (which are great in many ways) can make this behaviour seem even more unintuitive. My take-away is that understanding JavaScript properly is as important as ever despite the great abstractions and frameworks that let us ignore many of the details of how it works most of the time.

Learning to program properly

Paul Banks — Mon, 21 Feb 2011 00:00:00 GMT

Being inspired by MIT's introduction to algorithms I've decided to put some of my newly learnt stuff into practice. And there is not a lot of point in implementing this stuff a language like PHP or JS.

So I'm re-learning C. I learnt some very basics as part of my degree and have worked with Objective-C a fair bit so it's not completely alien. It's taken me a lot longer than it was suggested it should, but I've finally got a working implementation of skip lists.

Getting my head back around pointer arithmetic and memory management is a good exercise.

I feel like despite all the cool new languages around, most real infrastructure and interesting technology is still written in real languages like C.

jQuery 1.5 Beats Monster Callbacks Into Shape

Paul Banks — Tue, 01 Feb 2011 00:00:00 GMT

This is a shameless re-blog of Eric Hynds' article on jQuery deferreds. It's a great read.

jQuery 1.5 was out yesterday and includes several changes as one might expect. Deferreds are a new concept for me although reading Eric's great article above reveals a powerful and elegant new paradigm for handling callbacks in jQuery.

Essentially jQuery $.ajax functions (and most other functions with observable results) now return a deferred object which contains a promise. You can then hook callbacks to the success or failure of that promise and they will all be triggered when the promise is fulfilled. That means you can manage multiple bits of code that depend on an AJAX fetch separately and if you hook up a callback to the request after it has completed, it will be fired immediately.

Moreover, the API is very clean and simple with good semantic verbs for hooking things together. That makes the concept arguably easier to understand than plain function being passed callbacks, despite the extra power and decoupling.

There is quite a bit more to it than I've described though and Eric does a great job of explaining how and why you might want to use this powerful technique.

More from MIT: Red-Black Trees are Cool

Paul Banks — Sat, 29 Jan 2011 00:00:00 GMT

I mentioned a while back that I had found the lecture videos and notes for MIT's Introduction to algorithms course on Peter Krumins' blog. I'm still watching them, and they keep getting better.

I don't have anything particularly sophisticated to say about them other than being really impressed by Red-Black trees.

I've found the lectures have not only taught me a lot of things my formal education has lacked regarding algorithms, but it has helped change the way I view problems and enhanced my analytical skills, even without the problem sets, recitation, and "quizzes". You can probably see that by my over-analysis of a simple JS algorithm in my last post.

If you are interested in programming and you didn't learn this stuff at Uni, (or even if you did) I'd highly recommend the lectures once again

Quick benchmarks with jsFiddle

Paul Banks — Sat, 29 Jan 2011 00:00:00 GMT

At work this week a colleague asked if anyone could think of an optimisation for extracting a rectangular subset of pixel data from an HTML 5 CanvasPixelArray. I tried a few things with jsFiddle.

My main idea was that rather than iterating through every pixel and comparing coordinates (O(n) running time), you could loop through rows of pixels and remove just the selected ones with Array.slice().

I put together a quick test case with a simplified integer array and tried this. Turns out that although there are fewer iterations, using JS array slice() and concat() is much much slower probably due to multiple memory copies needed to satisfy them.

You can see and play with my test case on jsFiddle.

Note that you can make significant savings over the completely naive O(n) case by selecting loop boundaries such that only required rows are iterated and then, only required x value extracted with a nested loop. The running time of this case becomes something like O(k) where k is the number of pixels within the required selection which is strictly <= n but probably significantly smaller in most use-cases.

In algorithmic terms this is not a particularly surprising or ground-breaking result. My colleague has probably already found a better solution anyway. My take away was: jsFiddle is great for quick benchmarking and prototyping of solutions.

It makes it trivial to construct test cases and share and develop them with others. I did have one glitch triggered by my JS code getting too long for the textarea and causing the UI to 'scroll' but with no way to get it back. But even then a cut, refresh and paste got me back on my feet. And given it's 'alpha' status this is a relatively small complaint.

Learning Common Lisp

Paul Banks — Sun, 02 Jan 2011 00:00:00 GMT

I love to learn new programming languages. Common Lisp is a great language to learn just to broaden one's horizons.

It may not be particularly cool or popular right now and you may find the syntax ugly but there are so many ideas that really don't come up in other languages. You'll soon appeeciate why the syntax is like it is and why that is so powerful.

Especially if you are from a fairly rigid OO background, the emphasis on a more functional approach and the concepts of treating code as data are fascinating and may offer a new way to look at problems. But Common Lisp isn't truly functional and has a powerful object system too.

You do almost inevitably have to get over using Emacs as an editor but don't let that stop you from just learning and playing. Just the theory is enough to make me think in whole new ways about problems.

There are many resources at different levels but by far the best I've found is Practicle Common Lisp which has the added bonus of being available free online!

MacBook Pro: a cooling tip to ignore

Paul Banks — Sun, 02 Jan 2011 00:00:00 GMT

My main machine these days is still my trusty four-year-old MacBook Pro. I've used it with the battery removed for a while to reduce heat and fan noise when in desktop use. Turns out to be a terrible idea.

It may not be a surprise to you but after spending hours last week trying to work out why my machine was so much slower than my colleages, I stumbled across the fact that MacBook Pros throttle their CPU to 1GHz when the battery is removed. This seems extraordinary. Turns out the 90W power supply isn't deemed enough to run at full speed so with the batty unavailable to provide burst power, you get crippled CPU performance.

I originally removed the battery as it appeared to aid cooling when my machine was in desktop mode. Now it is hard to tell if it actually was a thermal benefit or whether the cooler temperatures were only because of the crippled performance.

Either way here's a tip: don't work without a battery on an apple notebook unless severly reduced performance is enough for you.

Living it LArge

Paul Banks — Sun, 19 Dec 2010 00:00:00 GMT

I've just got back (well technically not back yet) from my first trip to LA to meet my co-workers in person!

I really enjoyed it. The hospitality, food and party was great but actually I also really valued being able to talk work face-to-face with my colleagues for the first time. Such subtleties such as facial expression and body language have given me a much clearer idea of what people are like and how they are likely to respond. I'm sure it will make future online discussion significantly easier.

So there isn't much to say here other than I had a great time.

Go Back To Uni at Google

Paul Banks — Sun, 28 Nov 2010 00:00:00 GMT

If you read this blog you'll be aware that I am a geek and love finding new resources to learn more about geeky things online for free. I've lately found two great resources which will keep me interested for a while.

First is a set of lectures for an entire module on Analysis of Algorithms from MIT which is available from Peteris Krumins' blog (the videos were released under CC license). Fascinating and a great resource - made me realise how much I missed learning real subjects from real academics!

Second resource is probably more well-known but new to me. It's Google's Code University which seems to be a great resource listing many courses and materials on a wide range of computer science subjects.

It like christmas come early!

PHP 'Mixins' Coming Soon!

Paul Banks — Mon, 22 Nov 2010 00:00:00 GMT

I posted a while back about PHP's lack of decent support for multiple inheritance, and concluded that Mixin like behaviour just wasn't natural to PHP as it stands. PHP 5.4 looks like it will change that with the addition of traits.

Simas Toleikis introduces traits in a great blog post. I'm excited, it looks like it could solve many of the issues I bemoaned PHP as lacking.

Since I have no first hand experience, I'll regurgitate this example from Simas' post above to whet your appetite.

::php::
trait Singleton {
    public static function getInstance() { ... }
}

class A {
    use Singleton;
    // ...
}

class B extends ArrayObject {
    use Singleton;
    // ...
}

// Singleton method is now available for both classes
A::getInstance();
B::getInstance();

This is a major win for PHP.

Facebook ditches Cassandra for HBase

Paul Banks — Sat, 20 Nov 2010 00:00:00 GMT

Cassandra is an open source distributed database implementation that started life at Facebook as a solution to their message inbox search and storage. Facebook announced the next generation of messaging this week, and it's powered by HBase.

Highscalability.com have a good article about the announcement and technicalities generally. I was particularly interested having recently read Bradford's comparison of the two.

What I take from this is a firm reminder that different NoSQL solutions have different engineering trade-offs, and that picking the right tool for any one application is more important that brand loyalty.

CDE: very cool

Paul Banks — Mon, 15 Nov 2010 00:00:00 GMT

Just got pointed to a very cool project at stanford to allow linux command executions to be trivially packaged up with all dependencies on one machine and executed on another with no install/dependency issues.

Not tried it yet but it looks like a great and thoroughly useful little tool!

When Buzzwords Can't Save You

Paul Banks — Mon, 15 Nov 2010 00:00:00 GMT

Ooops! Github was down yesterday for several hours and I was expecting one of those "some complex as-yet-unidentified quirk of replication caused our sharded NoSQL cluster to drop every record with exactly 13 words in the title" type incident reports. Turns out a developer just deleted their production DB accidentally.

Fair play to them for the honest post though. This sort of thing does happen to everyone to a lesser or greater extent and I feel for the guy responsible. It does go to show though that Continuous Integration, Test Driven Development, Rails and all the other associated buzzwords don't always save you from the inevitable!

Lesson to learn: don't allow write access to production databases from dev environments. I'd have thought that with all their infrastructure and expertise, that should never have happened.

Google: Infrastructure Challenges Lecture

Paul Banks — Sun, 14 Nov 2010 00:00:00 GMT

Found a great link for a lecture by Jeffrey Dean from Google on the challenges of scaling their search product.

Some fascinating details including their byte encoding scheme for their index and many other wonderful bits of info!

PHP Gotcha: Strings are Arrays Too

Paul Banks — Wed, 10 Nov 2010 00:00:00 GMT

Actually the title is misleading - but they can at times seem like arrays. This just came up at work and although it is one of those things which seems obvious after, it highlights a potentially dangerous and error-prone design pattern.

Basically we had a function that looked essentially like this:

::php::
function error_prone($options) {
    if ( ! isset($options['required_key'])) {
        // throw error
    }

    // Manipulate $options['keys'] and return a result
}

At first glance this looks fair enough - while not completely robust, if a badly formed array (without required key) or a non-array is passed it should throw an error right?

Wrong. Due to PHP's doubling up of square brackets to work as character manipulation of strings.

If you don't know what I mean, try this:

::php::
$string = 'Hello World!';
echo $string[0]; // prints: H
echo $string['1']; // prints: e

Now the really unintuitive bit is highlighted on the last line. When you are specifying your character offset, PHP in all it's type-mangling wisdom, allows any variable type and casts to an int.

What does this mean for our function above? Well, in some bad cases, $options was being passed a string instead of an array. This was due to another error - but at first glance it seems our error checking should have caught that. What actually happens when a string is passed is:

$options['required_key'] return first char of $options string since (int) 'required_key' is 0
isset($options['required_key']) therefore returns true!
Code below mangles up some terrible return value based on chars in the string (mostly the first one) rather than actual options.
Final result is baffling and actual source of error is obfuscated

Solution

It really isn't hard to fix: either type hint the function declaration:

::php::
function not_so_error_prone(array $options)

And handle the errors/exceptions nicely, or actually explicitly test arrayness with is_array().

This is a silly mistake but one that can be easily overlooked when reviewing code unless you pay close attention.

CoffeeScript: another language to not learn

Paul Banks — Tue, 09 Nov 2010 00:00:00 GMT

My attention was drawn to CoffeeScript recently. I really like it, I think the author has made some changes that if they were part of javascript proper would make it somewhat nicer to work with. But it is completely pointless.

I mean no disrespect - if it solves a problem for the author then I can't argue. All I can say is that I completely missed the point. It isn't even fundamentally changing the way javascript works, no original programming paradigm or domain-specifc problem solving tools.

What it's ended up as is some syntax sugar taken variously from several popular languages around at the moment. It doesn't technically improve your javascript app at all and it adds a learning curve and complicates development and deployment by adding a compile phase.

CoffeeScript is just for fun

Fair enough. I can't slam anyone for doing something for fun. And no-one has claimed this to be the new best thing or a must-use tool so I'm not going to start my flamethrower without real provocation!

To the author: I think the syntax is nice, the features I can see being nice to have and you've done a really excellent job on the site/docs. In general though, this seems like a huge amount of effort for basically no gain.

Yet Another Facebook MySQL Tech Talk Re-post

Paul Banks — Tue, 09 Nov 2010 00:00:00 GMT

You've probably read it already but Facebook released a MySQL Tech Talk with loads of juicy database porn for those of us facinated by web scalability.

I've not got a lot to add other than it's pretty interesting.

PHP and the XOR swap trick

Paul Banks — Thu, 30 Sep 2010 00:00:00 GMT

An exercise most programmers are shown when first being introduced to bit arithmetic is how to swap the values of variables without using a third. The answer is the XOR (eXclusive OR) trick.

For some reason this came to mind this morning and I wondered - what happens in PHP if you try to XOR non integer data types? I could probably have looked it up but a 2 minute script showed me the answer: PHP casts whatever value to an int before performing the operation. So no, you can't neatly swap two arrays or objects or even strings without a temporary variable. Oh well.

For those who've not come across it before. Here is the magic I'm talking about. If you don't believe it works at first, grab a pencil and paper and work through the binary maths for yourself...

::php::
$a = 1;
$b = 2;

echo "a: ".$a.", b: ".$b."\n";

// Do the swap
$a ^= $b;
$b ^= $a;
$a ^= $b;

// All done
echo "a: ".$a.", b: ".$b."\n";

Result:

a: 1, b: 2
a: 2, b: 1

A Change of Scenery

Paul Banks — Fri, 24 Sep 2010 00:00:00 GMT

My wife Chloe and I are soon going to moving to pastures new. We have decided to escape city life and see how we fare in a more rural setting in Devon. That has meant me moving on to work for a different company too.

It was with some reluctance that I said goodbye to Ents24.com two weeks ago. It has been a great couple of years for me working there and I got on really well with all the staff.

My new challenge involves working remotely for the US-based art community deviantART. Working with a bigger team, remotely and across timezones are all new challenges for me but after two weeks I seem to be getting into the swing of things surprisingly easily.

Technically it is very different with many more visitors and a much more write-intensive and interactive emphasis. Lots more servers! The code base is also bigger, changes quicker and is no better documented than previous ones I've worked with. This is a little daunting at first although I'm already beginning to feel like I understand broadly how a lot of things work.

+1 to Textmate

Paul Banks — Thu, 23 Sep 2010 00:00:00 GMT

I posted a little while back about my trialling of Komodo edit as an IDE. It's good, but just a couple of things bugged me enough that I thought I'd see if I could live with the major change that is switching to TextMate.

Now to be completely fair, most of my gripes with Komodo edit were minor and were more to do with what I'm used to than it being a bad product. Since starting a new job (more to come soon) I discovered a few really useful tools had been written for our codebase for the latest beta of Komodo Edit (v6) so I switched to that.

The thing which made me try TextMate instead was actually that Komodo Edit 6 kept eating my CPU time for no real reason. After using it for a bit, even when I was doing nothing and there was no indication of background activity, it's CPU usage would sit at around 30% which over a while slowed my Mac right down and got it really hot (and loud). If I quit and restarted it would be fine again for a while but eventually would come back up.

This isn't all that surprising especially considering it is a beta version but it was annoying enough to make me consider alternatives again. Especially since TextMate is also favoured by other devs at my new place of work.

Actually I like it. It is less limited than I first thought and writing bundles is powerful and could allow me to reproduce many of the simpler features I miss from other editors. Full code completion isn't there but actually it does do basic PHP and same-file completion which covers a relatively large part of my needs. And I really love the speed and OSXiness.

So the Jury's still out. I'll see how I get on.

Xeround: no to NoSQL

Paul Banks — Wed, 15 Sep 2010 00:00:00 GMT

Just a quick note to point out an interesting developement in the 'distributed database' field. Xeround are developing a MySQL storage engine that has all the elasticity, redundancy and scalability of some of the popular NoSQL solutions with a 100% compatible MySQL interface.

This could be really interesting if it works as well as advertised since any MySQL app can migrate to it with no code change.

I do feel though that their technical whitepaper reads more like a marketing brochure than an academic discussion of the technology - there are no mentions of any downsides or tradeoffs in the design. Specifically, there is no mention of how much slower distributed joins and aggregation are than normal MySQL. just allusions to 'low latency'.

Essentially they have written a front end that does all the complex stuff you application would have to do with another NoSQL solution and then put a MySQL interface on it. If it works and really is fast then it is a very compelling solution. In the absence of benchmarks or real-world discussions though, I'm somewhat sceptical about whether this will really work well for complex queries on actually big data sets. I unfortunately don't have time (or data) to try it for myself but I will keep an eye out...

Why NoSQL is great and geek fights aren't

Paul Banks — Sun, 12 Sep 2010 00:00:00 GMT

I've been reading a lot about the recent stream of RDBMs alternatives that are getting a lot of attention at the moment. I find the subject fascinating and many of the solutions and technologies coming out make me want to go and summon hundreds of EC2 instances just to distribute some random data over.

My 30 second NoSQL overview: relational databases can become unwieldy once you need to scale beyond the capacity of a single server. Some applications can easily take advantage of multiple read slaves but with enough write traffic things need to get more exotic. Enter sharding). If you want to read more about sharding, go ahead. Suffice it to say that once your data is split over separate physical machines, a substantial portion of what makes relational databases and SQL great goes out the window. No more joins, aggregate queries etc.

Given that the relational part is now severely impaired, people like Google and Amazon have come up with massively distributed systems that are effectively just glorified key-value stores or hash tables. They purposely don't support these more exotic relational features but they can handle petabytes of data and millions of users. The Google BigTable and Amazon Dynamo papers are a great read.

NoSQL encompasses these sorts of solutions as well as document-oriented databases like MongoDB and CouchDB and stricter KV stores like Reddis or Project Voldemort.

So is SQL dead?

As with so many things in our industry, NoSQL has caused a lot of hype and a lot of unnecessary angst. I was prompted to write this by a recent article on readwriteweb which links to a video someone has made. The video itself is somewhat amusing and makes some good points although could have done so much more succinctly and with less profanity in my opinion.

It would be great to see a little more sensible discussion about real-world use cases for new technology and much fewer turf wars. NoSQL is really interesting and, though it's tempting to assume new things are a silver bullet for all the current problems in a domain, we all know this is not the reality. As engineers we should take a great interest in new technology but ultimately we should pick the right tools for the job. For now SQL is probably the best overall tool for the majority of web applications.

NoSQL solutions can solve some interesting problems however these will probably be limited to big-data, big-traffic sites. 99.99% of web apps written are never going to get near to having those sorts of problems and abandoning a mature, proven technology like SQL should not be taken lightly.

So I'm going to continue to enjoy learning about new ways to do things, use them where they actually help, and steer clear of pointless time-wasting arguments.

PHP Gotcha: are MD5 hashes numeric?

Paul Banks — Tue, 07 Sep 2010 00:00:00 GMT

A bizarre bug just came up at work: a query in a cron script failed last night for no apparent reason even though thousands of queries are run by the same bit of code every day. The reason: an MD5 hash being incorrectly identified as a number in exponential form.

Firstly I guess I should point out that yes MD5 hashes are numeric, however in PHP md5() returns a string containing the hex digest. For this reason MD5 hashes are generally considered and used as strings in PHP.

We have a Database API at work that provides automatic escaping of values based on their type. It uses PHPs is_numeric() to determine if the value should be left unquoted as an integer or float.

One thing that isn't likely to come up much (but typically just did) is that is_numeric() also recognises numbers in exponential form 1234e34. We had an issue where we were inserting an MD5 hash (a string) into a varchar field. But got an error from MySQL:

Illegal double '937e3019763158166689073439699767' value found during parsing

I took a look at this for a bit and then realised that the value was unquoted and contained only digits and 'e'.

We've put in a little more logic now that assumes that any string of exactly 32 chars and containing only hex digits (hint: ctype_xdigit()) is treated as a string!

To PHP or not to PHP?

Paul Banks — Mon, 30 Aug 2010 00:00:00 GMT

I've recently run up against the limitations of PHP's OO features in many different projects. While there are some potential solutions, I'm in two minds about whether they are a good idea or not.

For example, languages like Ruby and JavaScript allow 'Monkey Patching' or modifying classes/object's methods at run time. While some complain that this can cause terrible code and bugs that are very hard to track down, it allows things like behaviors (i.e. mixins for multiple inheritance) which can be a very powerful way of keeping code modular.

Also, AOP is a powerful tool for reducing code coupling and increasing code reuse. In JS or Ruby you could implement this easily by altering methods at run-time, in Java you can do it by altering methods at compile-time. In PHP you're stuck unless you add an additional 'compile' step into your workflow, negating most of the benefits of using an interpreted language.

In PHP, the closest we get (natively) is to use magic methods like __call() to intercept object method calls and do something else instead. There are two major problems with this

You have to fudge the scope about - there is no non-hacky way to add a method to a class from outside it and be able to use $this and other object properties as expected.
__call() is very slow even compared to standard PHP function calls. This can be a real issue if you are using it extensively and may have thousands of calls in a single page load.
It's not native - you end up having to add code to all your objects, or artificially alter the inheritance tree or wrap all your objects in proxy object or similar to get this to work.

(Enter Runkit)

Runkit is a PECL extension that adds a few interesting methods to PHP that allow methods to be added/removed/copied between objects dynamically at runtime. The solutions to all the problems above? I'm not so sure.

The (new) problems:

It's non-standard. Goodbye to code portability. This will never be maintstream and so neither will all the work you put into classes/libraries.
It's experimental. It seems not enough interest has been shown in runkit and so, despite it being around for a while (at least 5 years!), it is still not recommended for production applications.
I can find no information about performance (and haven't had time to benchmark myself since it would mean recompiling PHP on my machine). I'd be very surprised if it didn't reduce the effectiveness of op-code caching substantially.

Pretty major downsides, but my question goes beyond this. I'm still really torn about whether it is even right to want this in PHP. If I really want ruby-like syntax and mixins, why don't I just write in Ruby?

Every language has it's strengths and weaknesses, I wonder if spending a lot of time and effort trying to emulate constructs possible in other languages is just bad PHP programming. Is it a flaw in PHP that it can't support neat mixins (without hacks or ugly code)? I'm not sure.

I've been interested in the decision to drop behavior support in Doctrine 2 because it was too much of a hack and caused nightmare bugs. It was one of the features I most sought to emulate in other similar projects but ran into many of the same problems as the Doctrine team.

On the one hand, I'd really like to able to neatly and efficiently solve problems like multiple inheritance and providing 'magic' interfaces to ORM objects without restricting class inheritance etc but on the other hand, if it feels to much like a hack I end up feeling like I'm just using the wrong tools.

If you read this and have any thoughts, I'd love to hear what you think.

Managing iPhone SDK versions and targeting

Paul Banks — Fri, 20 Aug 2010 00:00:00 GMT

There are many subtleties when deploying apps with the iPhone SDK. One I have spent much time fiddling with in the past and actually got to the bottom of today is the difference between Base SDK and Deployment Target.

I spent an hour or so today jumping through the familiar hoops of getting some updates into our iPhone app at work. The latest updates involve dealing with background state transitions which requires accessing new API methods added in iOS 4.

The SDK docs do describe well how to ensure that API methods exist before calling version specific methods (using respondsToSelector).

The problem came when compiling for distribution. We ended up with an error saying:

undeclared UIApplicationWillEnterForegroundNotification

This was because, in an attempt to make the application as widely usable as possible I'd set the Base SDK setting to iPhone 3.

Clearly this means you can't use iOS 4 specific API calls. The trick is to set the Base SDK to the latest version and then set the iPhone OS Deployment Target option to be the lowest version you wish to support. This way you can (conditionally) use all the latest API calls.

In this configuration, newer APIs are weakly linked which means that, provided you check APIs exist before accessing them as mentioned above, you app will still run on older OS versions with no compile or runtime errors.

With hindsight it seems obvious but it was one of those things that took a bit of effort to grasp.

The Era of Komodo Edit

Paul Banks — Sat, 14 Aug 2010 00:00:00 GMT

I'm reconfiguring a whole bunch of things on my machine. In the process, I couldn't bear to install Aptana 1.5 again. I've been using it for a year or so since ZDE went over to Eclipse and became even more horrible to use.

Aptana is also Eclipse based which doesn't fill me with Joy but 1.5 did have excellent PHP support. My love affair with it was not long lived though as the 2.0 release last year completely decimated everything that had made me move from ZDE by dropping Aptana's PHP plugin for PDT. I've also recently had a complete nightmare helping colleagues getting Aptana 1.5 to work on Ubuntu.

I'd really love something lighter. TextMate is sleek, simple and beautiful but just doesn't cut it for me when it comes to code completion - I've come to rely on that for bigger projects and it's just too hard to go back! Netbeans is actually surprisingly good now although it is Java and getting close to being Eclipse-like.

In the end, I've been pretty impressed with Komodo Edit. ~~Still Java I guess but~~ [Not Java so] it feels a lot lighter than Eclipse, supports pretty much all the main features I use constantly in Aptana and generally seems good.

I'll stick it out for a bit and see how it holds up once I start doing serious work.

Getting Social

Paul Banks — Tue, 10 Aug 2010 00:00:00 GMT

I've enabled comments on the blog using DISQUS. Sure I could have written my own into my lovely little rails app but that would be a lot of effort once moderation, spam control, user signup, OpenID integration etc. are included.

DISQUS actually seems to be quite a neat solution for now and will hopefully make this content slightly more interesting once I start getting a bit of traffic through.

I still have a couple of things to add to the blog (including more content) before I make more of an effort to get spidered and promoted at large on the web.

JS on the server?

Paul Banks — Mon, 09 Aug 2010 00:00:00 GMT

Server-side javascript is something that I have been dimly aware of for a while now. At first, I thought it was yet another pointless attempt to use technology in ways it wasn't designed to be used, but recently it's been starting to make a little more sense to me.

I won't go into too much detail here because ReadWriteWeb have done it better. Also of interest, Felix Geisendörfer (a former CakePHP contributor) also explains why he's more interested in Node.js now.

The most interesting point for me is this: JS is currently getting a lot of attention (read millions of dollars) from many of the biggest players in the industry: Google, Apple, Mozilla and Microsoft. That means it is being optimized and getting faster at a rate that other server side languages can only dream of.

I've heard (but not verified) that some bits of Node.js' C++ runtime have actually been re-written in JS because there was no performance difference!

Now I love JS as a language, it is getting seriously quick, and we are getting closer to having a stable and relatively mature stack to write applications with. It probably isn't going to power the next twitter but this is certainly worth looking into.

When not to use arrays in PHP

Paul Banks — Mon, 02 Aug 2010 00:00:00 GMT

Arrays in PHP are actually pretty inefficient at storing lots of small bits of data.

I think I read that a single integer in an array uses 58 bytes of memory. Now that is not worth thinking about in most PHP applications, but it can matter when your arrays get big.

At work, I've recently been working on a system for highly customisable targeting in email newsletters. Part of the challenge here is writing the newsletter sender capable of sending 400,000+ emails with each one potentially targeted at the specific user efficiently.

For speed, we ended up pre-calculating which bits of content got rendered for each user to avoid the query overhead in the sending loop. This required storing a two-dimensional array, indexed by user ID, with an array of content IDs applicable to each user. It looked something like this:

::php::
array(
    1 => array(1, 2, 3, 4),
    2 => array(2, 4, 3, 5),
    ...
);

But with around 400,000 elements. I was amazed that this array alone took up over 400MB of memory!

I did some benchmarks and found some quite surprising things.

If I changed the array so each user had a comma separated string instead of an array, it shrank the memory requirements down by around 75%.

I suspected though that I would pay for this saving in execution speed - surely it is much slower to have to explode each value manipulate the array and implode again than just direct access right?

Actually, no! I can only guess as to why that might be - less RAM to read/write/seek perhaps? It is actually over 60% quicker to use strings in this case!

You probably don't quite believe me so here is a little script to illustrate the point. Feel free to run it yourself.

And the result:

Memory for 2D:        107.81MB
Memory strings:        23.89MB
Time for 2D:        6.47 seconds
Time strings:        2.32 seconds

On the bandwagon

Paul Banks — Sun, 01 Aug 2010 00:00:00 GMT

I'm a PHP developer by profession but this blog is a bit of a departure for me - it's a Ruby on Rails app.

I could have used pretty much any open source blog package out there, but I wanted to have a go at producing something real that worked with rails. There are also a few features I didn't find elsewhere (mostly formatting related).

My verdict: Ruby is a fantastic language. Rails is a great framework, but there is a little too much magic and opinion in it for my taste. I like to understand exactly what is going on at all levels in my apps but rails makes that incredibly hard because of all the monkey patching and the sheer size of the stack.

That's not really much of a criticism though and I think at least some of that will be much better in rails 3 having taken onboard the merb philosophy of decoupling core components.

I'll certainly keep working with Ruby and probably rails - there is something so elegant about the language that it makes going back to PHP a little disappointing. Who knows, one day I might be completely converted.

Expect me to post my web-related discoveries and opinions here in the future.

banksco.de blog

The State of Real-Time Web in 2016

WebSockets

WebSocket Pros

WebSocket Cons

WebSocket Polyfills

WebSocket Polyfill Pros

WebSocket Polyfill Cons

Server Sent Events/EventSource

EventSource Pros

EventSource Cons

XMLHttpRequest/XDomainRequest Streaming

XHR/XDR Streaming Pros

XHR/XDR Streaming Cons/Gotchas

XMLHttpRequest/XDomainRequest Long-polling

XHR/XDR Long-polling Pros

XHR/XDR Long-polling Cons

JSONP Long-polling

JSONP Long-polling Pros

JSONP Long-polling Cons

Polling

Polling Pros

Polling Cons

Others

The Future(?)

Do you need bi-directional sockets?

Understanding Distributed System Guarantees

When is Strong Consistency worth it?

Commutativity and Idempotence

Trade-offs

Google's AdSense/DFP PII Privacy Gotcha

What is a PII (Personally Identifiable Information) Violation?

Fail #1: Your Search Results Page

Fail #2: Users Saving Your Content For "Offline Use"

Anti-Solution

Solution

LMDB: The Leveldb Killer?

Disclaimer

Understanding the Trade-offs

My Questions

Write Amplification

File Fragmentation

Compression

Large Transactions Amplify Writes Even Further

Disk Reclamation

Summary

Meet Handlebars.js

Why Handlebars?

Helpers

{{title}}

An example

Helpers for Content Selection

Next and Previous

{{title}}

{{{content}}} {{#prev_page}} "{{url}}">« {{title}} {{/prev_page}} {{#next_page}} "{{url}}">{{title}} » {{/next_page}}

Pushing the Boundaries

Conclusion

Fancy New Blog

Design

Technology

Managing Static Content

Publishing

Conclusion

PHP Arrays (Again)

Moving Data and Telling People About It

Catch Up

Machine Learning

C++ Experiments

LevelDB Fun

More Efficient PHP Arrays

PHPUnit's Expensive SetUp

The Problem

Opera's fixed position problem

Trying to make JS go OO

Instance properties

Back to basics

Learning to program properly

jQuery 1.5 Beats Monster Callbacks Into Shape

More from MIT: Red-Black Trees are Cool

Quick benchmarks with jsFiddle

`{{{content}}} {{#prev_page}} "{{url}}">« {{title}} {{/prev_page}} {{#next_page}} "{{url}}">{{title}} » {{/next_page}}`