Google's AdSense/DFP PII Privacy Gotcha

Google AdSense (and other advertising products) appear to have turned on a new detection system for violations of their PII policy. Here are a couple of easy steps to fall foul of it without meaning to.

What is a PII (Personally Identifiable Information) Violation?

Google's support document for their privacy policy describes the issue well.

Specifically the line:

In particular, please make sure that pages that show ads by Google do not contain your visitors' usernames, passwords, email addresses or other personally-identifiable information (PII) in their URLs.

Easy right? Just don't be dumb and pass the visitors info in URL.

Here are a couple of ways to fail at that automatically - you may well be doing them yourself right now.

Fail #1: Your Search Results Page

If like every other site on the Internet, you have a search feature, and like every other search engine (copying Google themselves) you present the search results at a URL with the user's query in a GET parameter e.g. search/?q=ponies, then you will probably violate AdSense policy eventually.

All but one of the handful of breach notifications I've come across from Google are due to people searching for an email address. That obviously results in a URL that looks like search/?q=user@exmaple.com and Google's apparently new automated detection of PII violations flags that.

As an aside, it's quite likely that they are not searching for their own email address and so it's not really PII, but who can know? Google will see an email address in URL and call it a breach.

But that's pretty standard behavior for a search box right? We'll see what we can do in a bit.

Fail #2: Users Saving Your Content For "Offline Use"

Less obviously, you may have users who like to "save" pages on your site for "offline" use. Of course if they really are offline when they view it they will not make any ad requests to Google and you'll be fine.

But in at least one case I've come across, a user saved a page on a site which uses Google AdSense to a folder on their machine like /Users/name@example.com/stuff/your-page.htm. They must have loaded this in their browser while still online and the resulting ad calls from the JS embedded in the page when they saved it fire off to Google with ?url=file:///Users/name@example.com/... in the URL, violating their policy.

Anti-Solution

Actually in both of these cases it's a little hard to find a good solution. For the first you could perhaps have a special encoding of email addresses in search JS but that goes against the spirit of the policy - the email info is still there if it is in fact PII (probably not if visitor is searching for it but there you go). Not to mention relying on JS instead of regular GET form submission.

It's a nasty hack and misses many other cases. The second example is one of the many possible reasons you would probably never consider which would not be covered by sticking plasters like that.

Solution

It would seem the most pragmatic solution is to adjust your ad serving logic to scan URL and referrer for email addresses and just opt to not show ads on that page if you find any. Hopefully it's rare enough not to lose you too much revenue and the alternatives are likely more expensive.

One thing to note though: checking for email addresses on server side is probably not enough. It wouldn't have caught that second case above and there are many other cases where it might not work.

So if you rely totally on Google's provided libraries to display your ads, you may need to write a small wrapper to handle this case on the client side. It seems like this should be something Google's JS does automatically or at least something you can opt-in to.

 

LMDB: The Leveldb Killer?

I've been quiet for a while on this blog, busy with many projects, but I just had to comment on my recent discovery of Lightning Memory-Mapped Database (LMDB). It's very impressive, but left me with some questions.

Disclaimer

Let me start out with this full acknowledgement that I have not yet had a chance to compile and test LMDB (although I certainly will). This post is based on my initial response to the literature and discussion I've read, and a quick read through the source code.

I'm also very keen to acknowledge the author Howard Chu as a software giant compared to my own humble experience. I've seen other, clearly inexperienced developers online criticising his code style and I do not mean to do the same here. I certainly hope my intent is clear and my respect for him and this project is understood throughout this discussion. I humbly submit these issues for discussion for the benefit of all.

Understanding the Trade-offs

First up, with my previous statement about humility in mind, the biggest issue I ran up against when reviewing LMDB is partly to do with presentation. The slides and documentation I've read do a good job of explaining the design, but not once in what I've read was there any more than a passing mention of anything resembling a trade-off in the design of LMDB.

My engineering experience tells me that no software, especially when attempting to claim "high performance" comes without some set of assumptions and some trade-offs. So far everything I have read about LMDB has been so positive I'm left with a slight (emphasis important) feel of the "silver bullet" marketing hype I'd expect from commercial database vendors and which I've come to ignore.

Please don't get me wrong, I don't think the material I've reviewed is bad, just seems to lack any real discussion of the downsides - the areas where LMDB might not be the best solution out there.

On a personal note, I've found the apparent attitude towards leveldb and Google engineers a little off-putting too. I respect the authors opinion that LSM tree is a bad design for this purpose but the lack of respect toward it and it's authors that comes across in some presentations seems detrimental to the discussion of the engineering.

So to sum up the slight gripe here: engineers don't buy silver-bullet presentations. A little more clarity on the trade-offs is important to convince us to take the extraordinary benchmark results seriously.

[edit] On reflection the previous statement goes too far - I do take the results seriously - my point was more that they may seem "to good to be true" without a little more clarity on the limitations. [/edit]

My Questions

I have a number of questions that I feel the literature about LMDB doesn't cover adequately. Many of these are things I can and will find out for myself through experimentation but I'd like to make them public so anyone with experience might weigh in on them and further the community understanding.

Most of these are not really phrased as questions, more thoughts I had that literature does not address. Assume I'm asking the author or anyone with insight their thoughts on the issues discussed.

To reiterate, I don't claim to be an expert. Some of my assumptions or understanding that lead the the issues below may be wrong - please correct me. Some of these issues may not be at all important in many use cases too. But I'm interested to understand these areas more so please let me know if you have thoughts or preferably experience with any of this.

Write Amplification

It seems somewhat skimmed over in LMDB literature that the COW B-tree design writes multiple whole pages to disk for every single row update. That means that if you store a counter in each entry then an increment operation (i.e, changing 1 or 2 bits) will result in some number of pages (each 4kb by default) of DB written to disk. I've not worked out the branching factor given page size for a certain average record size but I guess in realistic large DBs that could be in the order of 3-10 4k pages written for a single bit change in the data.

All that is said is that "it's sequential IO so it's fast". I understand that but I'd like to understand more of the qualifiers. For leveldb in synchronous mode you only need to wait for the WAL to have the single update record appended. Writing 10s of bytes vs 10s or 100s of kbytes for every update surely deserves a little more acknowledgement.

In fact if you just skimmed the benchmarks you might have missed it but in all write configurations (sync, async, random, sequential, batched) except for batched-sequential writes, leveldb performs better, occasionally significantly better.

Given that high update throughput is a strong selling point for leveldb and the fact that LMDB was designed initially for a high-read ratio use case I feel that despite the presence in stats all of the rest of the literature seems to ignore this trade-off as if it wasn't there at all.

File Fragmentation

The free-list design for reclaiming disk space without costly garbage collection or compaction is probably the most important advance here over other COW B-tree designs. But it seems to me that the resulting fragmentation of data is also overlooked in discussion.

It's primarily a problem for sequential reads (i.e. large range scans). In a large DB that has been heavily updated, presumably a sequential read will on average end up having to seek backwards and forwards for each 4k page as they will be fragmented on disk.

One of the big benefits of LSM Tree and other compacting designs is that over time the majority of the data ends up in higher level files which are large and sorted. Admittedly, with leveldb, range scans require a reasonable amount of non-sequential IO as you need to switch between the files in different levels as you scan.

I've not done any thorough reasoning about it but seems from my intuition that with leveldb the relative amount of non-sequential IO needed will at least remain somewhat linear as more and more data ends up in higher levels where it is actually sequential on disk. With LMDB it seems to me that large range scans are bound to perform increasingly poorly over the life of the DB even if the data doesn't grow at all, just updates regularly.

But also, beyond the somewhat specialist case of large range scans, it seems to be an issue for writes. The argument given above is that large writes are OK because they are sequential IO but surely once you start re-using pages from the free list this stops being the case. What if blocks 5, 21 and 45 are next free ones and you need to write 3 tree pages for your update? I'm aware there is some attention paid to trying to find contiguous free pages but this seems like it can only be a partial solution.

The micro benchmarks show writes are already slower than leveldb but I'd be very interested to see a long-running more realistic benchmark that shows the performance over a much longer time where fragmentation effects might become more significant.

Compression

The LMDB benchmarks simply state that "Compression support was disabled in the libraries that support it". I understand why but in my opinion it's a misleading step.

The author states "any compression library could easily be used to handle compression for any database using simple wrappers around their put and get APIs". But that is totally missing the point. Compressing each individual value is a totally different thing to compressing whole blocks on disk.

Consider a trivial example: each value might look like {"id": 1234567, "referers": ["http://example.com/foo", "https://othersite.org/bar"] }. On it's own gzipping that value is unlikely to give any real saving (the repetition of 'http' possibly but the gzip headers is more than the saving there). Whereas compressing a 4k block of such results is likely to give a significant reduction even if it is only in the JSON field names repeated each time.

This is a trivial example I won't pursue and better serialisation could fix that but in my real-world experience most data even with highly optimised binary serialisation often ends up with a lot of redundancy between records - even if it's just in the keys. Block compression is MUCH more effective for the vast majority of data types than the LMDB author implies with that comment.

Leveldb's file format is specially designed in such a way that compression is possible and effective and it seems Google's intent is to use it as a key part of the performance of the data structure. Their own benchmarks show performance gains of over 40% with compression enabled. And that is ignoring totally the size on-disk which for many will be a fairly crucial part of the equation especially if relatively expensive SSD space is required.

One argument might be that you could apply compression at block level to LMDB too but I don't think it would be easy at all. It seems like it relies on fixed block size for it's addressing and compressing contents and leaving blanks gives no disk space saving and probably no IO saving either since all 4k is likely still read from disk.

I'm pretty wary of the benchmarks where leveldb has compression off since I see it as a fairly fundamental feature of leveldb that it is very compression friendly. Any real implementation would surely have compression on since there are essentially no downsides due to the design. It's also baked in (provided you have the snappy lib) and on by default for leveldb so it's not like it's an advanced bit of tuning/modification from basic implementation to use compression for leveldb.

Maybe I'm wrong and it's trivial to add effective compression to LMDB but if so, and doing it would give ~40% performance increase why is it not already done and compared?

I'd like to see the benchmarks re-run with compression on for leveldb. Given writes are already quicker for leveldb this more realistic real-world comparison might well give a better insight into the tradeoffs of the two designs. If I get a chance I will try this myself.

Large Transactions Amplify Writes Even Further

LMDB makes a big play of being fully transactional. It's a great feature and implemented really well. My (theoretical) problem is to do with write performance - we've already seen how writes can be slower due to COW design but how about the case when you update many rows in one transaction.

Consider worst case that you modify 1 row in every leaf node, that means that the transaction commit will re-write every block in the database file. I realise currently that there is a limit on how many dirty pages can be accumulated by a single transaction but I've also read there are plans to remove this.

Leveldb by contrast can do an equivalent atomic batch write without anywhere near the same disk IO in the commit path. It would seem this is a key reason leveldb is so much better in random batch write mode. Again I'd love to see the test repeated with leveldb compression on too. [edit] On reflection, probably not such a big deal - writes to the WAL in leveldb won't be affected by compression. [/edit]

It may not be a problem for your workload but actually it might. Having certain writes use so much IO could cause you some real latency issues and given single writer lock, could give you similar IO-based stalls that leveldb is known for due to it's background compaction.

I'll repeat this is all theoretical but I'd want to understand a lot more detail like this before I used LMDB in a critical application.

Disk Reclamation

Deleting a large part of the DB does not free any disk space for other DBs or applications in LMDB. Indeed there is no built in feature or any tools I've seen that will help you re-optimise the DB after a major change, nor help migrate one DB to another to reclaim the space.

This may be a moot point for many but for practical systems, having to solve these issues in the application might add significant work for the developer and operations teams where others (leveldb) would eventually reclaim the space naturally with no effort.

Summary

I feel to counter the potentially negative tone I may have struck here, I should sum up by saying LMDB looks like a great project. I'm genuinely interested in the design and performance of all the options in this space.

I would suggest that a real understanding of the strengths and weaknesses of each option is an important factor in making real progress in the field. I'd humbly suggest that, if the author of LMDB was so inclined, including at least some discussion of some of these issues in the docs and benchmarks would benefit all.

I'll say it again if Howard or anyone else who's played with LMDB would like to comment on any of these issues, I'm looking forward to learning more.

So is LMDB a leveldb killer? I'd say it seems good, but more data required.

 

Meet Handlebars.js

In making this blog I ended up using Yehuda Katz' Handlebars.js for templating. It has some intersting features I'll introduce here, but arguably dilutes Mustache's basic philosophy somewhat.

I found Handlebars to be a powerful extension to Mustache but I want to note up-front that it quite possibly isn't the best option in every case. Certainly if you need implementations outside of Javascript it's not (yet) for you, however I'm also aware that the extra power added comes with a potential cost: you can certainly undo many of the benefits of separating logic and template.

With that note in place. I'll introduce the library.

Why Handlebars?

Yehuda has already outlined his rationale for creating Handlebars so I won't go into too much detail here. The important goals can be summed up as:

I encourage you to read his article for a lot more detail and explanation of those points but we'll crack on for now.

I won't cover all the features here. You can read them in the documentation. For now I want to highlight the power (and possible danger) of helpers.

In the case of my static site generation system, my main goal was to have a very thin layer of logic on top of simple content-with-meta-data files with some simple naming conventions. I wanted flexibility in the templating system so that I could generate menus or listings of content without writing extra code for each case.

With Mustache, this flexibility had to happen in the view layer and so became a little clumsy to express in a general and extensible way the data sets required for any page.

It turned out to be much neater and require a lot less "magic" code to be able to make the templates a little more expressive. Helpers were the key.

Helpers

Handlebars adds to Mustache the ability to register helpers that can accept contextual arguments. Helpers are simply callbacks that are used to render {{mustaches}} or {{#blocks}}{{/blocks}}. They can be registered globally or locally in a specific view. We'll use global registration here to keep examples clearer.

Here's a basic example of a block helper that could be used for rendering list markup.

<h1>{{title}}</h1>
{{#list links}}
    <a href="{{url}}">{{name}}</a>
{{/list}}

Here's the context used

{
    title: 'An example',
    links: [
        {url: 'http://example.com/one', name: 'First one'},
        {url: 'http://example.com/two', name: 'Second one'}
    ]
}

And here is the list helper definition:

Handlebars.registerHelper('list', function(links, options){
    var html = "<ul>\n";
    for (var i = 0; i < links.length; i++) {
        html += "\t<li>" + options.fn(links[i]) + "</li>\n";
    }
    return html + "</ul>\n";
});

When you compile this and render with the context data above, you would get the following output:

<h1>An example</h1>
<ul>
    <li><a href="http://example.com/one">First one</a></li>
    <li><a href="http://example.com/two">Second one</a></li>
</ul>

You can read similar examples in the documentation which have much more complete explanations of the details here but the basics should be clear:

With that very brief overview example, I want to move on to more interesting examples. If you want to read more about the specifics about what happened there then I'd encourage reading the block helpers documentation.

Helpers for Content Selection

Before I continue, I need to acknowledge that what follows breaks everything you know about MVC separation of concerns. I know. Bear with me for now.

My site generation system builds the site files based on filesystem naming conventions. For things like the blog home page I wanted to show the 5 most recent blog posts.

Internally the system reads the whole content file structure and builds an in-memory model of the content. Each directory has two indices: one for all articles with a date in the file name (most recent first) and an index of all other article files in alphabetical order. You can then get the object representing that directory and list the articles in either the date-based or name-based index.

For convenience, I developed an internal API that made this easy using "content URLs" for example Content.get('/p/?type=date&limit=5') which will return the most recent 5 dated articles in the /p/ directory.

From there it is pretty simple to be able to make a block helper that allows templates like this:

<ul>
{{#pages '/p/?type=date&limit=5'}}
    <li><a href="{{url}}">{{title}}</a></li>
{{/pages}} 
</ul>

Next and Previous

But listings aren't the only case this is useful. On the bottom of each blog article I have links to next/previous articles (if they exist) and these need the URL and title of the neighbouring items in the dated index.

I did this with another couple of block helpers. The blog template looks a bit like this:

<h1>{{title}}<h1>
{{{content}}}
<footer>
    {{#prev_page}}
        <a href="{{url}}">&laquo; {{title}}</a>
    {{/prev_page}}
    {{#next_page}}
        <a href="{{url}}">{{title}} &raquo;</a>
    {{/next_page}}
</footer>

The helper itself uses this which is the current context (in this case the main blog article being displayed). It then looks up in the content index the article's parent directory, and locates the previous or next item in the index relative to the current one. It then calls options.fn with the neighbouring article object as context.

Pushing the Boundaries

From here there is a lot of grey areas you could probe with this powerful construct. For example, let's assume you have different modules of your app rendering themselves and then being combined by some layout controller and rendered into a layout.

What if you wanted to have the module's external CSS or JS requirements actually defined in the template that really has the dependency. Right off the bat, I'll say I can't think of a real reason you'd want this and not have it taken care of outside of the templating layer, but…

You could have a helper for ensuring the correct CSS is loaded up-stream in the template like:

{{add_css 'widget.css'}}
<div class="widget">
    ...
</div>

And then have the helper defined such that it adds the arguments passed to the layout controller and returns nothing to be rendered.

Then the layout rendering might link those CSS assets in the head.

You're right. This is almost certainly a bad idea. I mention it because it was something that occurred to me for a second before I recognised that is was an example of probably dangerous usage. When you get the hang of a powerful concept like this it's easy to start seeing every problem that can be possibly solved with it as a good candidate.

As with all powerful programming concepts and libraries, there are many things you can do with Handlebars helpers that are really bad ideas. Hence my note of caution at the start.

Conclusion

I'm quite happy with the extra power Handlebars has given me in this context. But I'm certain that with the extra power comes the inevitable responsibility. It is certainly possible to write crazy and unmaintainable code if you get too creative with helpers without thought.

The examples here are probably not best practice for an MVC web-app context. But here in a site generation script with an already in-memory content model, it allowed me to extend the expressiveness of the system without hard-coding a lot of specific logic for different cases in the model layer.

Handlebars.js has many more features than I have touched on here. Check it out. It may just be what you are looking for if you really like Mustache's philosophy but have a need (and the discipline) to make more expressive helpers.

 

Fancy New Blog

Same poor content, new styling (and backend)

I made my blog a few years ago as a way to learn Rails and after a year in which I posted only two new articles of very low value, I got inspired to give it a revamp.

This article is a long-and-yet-brief overview of the changes..

Design

The style just suits my tastes better. When I made the last version, I was much more focussed on learning Rails and the aesthetics of the site became somewhat secondary. I was inspired by some beautiful and elegant sites I've seen recently and this design was the result (for now). I have dreams of adding beautiful imagery and other fancy things to some articles too, but we'll see.

Fonts are from Google's Webfonts rather than TypeKit's free plan because there is much more freedom without fees. I also get to download the fonts I use here for offline use.

Technology

I went back to basics for this. Since I made the last version of this blog, my tastes in tech have changed a bit. I've come to value simplicity and efficiency more and more. Having a full Rails stack, web servers, proxies, database, user authentication, SSL certificates etc. suddenly seems like a really ugly solution for what is essentially a simple, static site only updated by me.

So this site is a static site.

After I finished a lot of the work on this new system described below, I came across an article from a friend of mine about his CMS solution. It turns out he had a lot of the same ideas and he does a great job of expressing his rationale for moving away from Wordpress-like apps for CMS. I link to that now to save you from more clumsy words from me repeating many of the same things less eloquently.

Managing Static Content

So this site is just HTML files served by good old Apache. Nginx probably would be my first choice on a dedicated server but I'm enjoying my current stay on Webfaction and this is the most appropriate configuration here.

Managing a static site by hand is so 1990s, clearly we can do better than that.

There are actually a bunch of great static CMSs out there that would have been great, Jekyll (Github Pages), Statamic (Commercial) and Kirby being the main ones I came across. Typically, I ended up building my own for no terribly good reason other than it being a good excuse to learn something and end up with exactly the features I need.

The site generation is done by a Node.js app. The content is managed through the file system with a simple naming scheme allowing for articles to participate in ordered indices. For example, if the file name begins with a YYYY-MM-DD date format then it will be added to a newest-first by-date index for that directory. More on these indices later.

The content files themselves are then simply Markdown files with yaml-front-matter to add some meta-data to each. Meta-data typically includes the title (so it can be re-used for page title and in listings/RSS) and a template file to use to render that page.

Templating uses Handlebars which is Mustache with a little more flexibility. This extra flexibility becomes really useful in conjunction with the content indices I mentioned before. For example, all the posts on this blog are dated files in the /p/ directory. To generate the listing on the front page of the site I just need to make a static page called /index.md with meta-data assigning a template that does something like:

{{#pages '/p/?type=date&limit=5'}}
    <article>
        {{> blog_post_body.mu }}
    </article>
{{/pages}}

And my custom pages helper can go and find the date index for the /p/ directory and pull out most recent five articles.

As well as being defined per file in YAML front matter, meta-data defaults for a whole directory can be set in a defaults.yml file (e.g. all blog posts use the same template so it is declared once in /p/defaults.yml) and these defaults are inherited through the content directory hierarchy.

There is another special content file naming convention for specifying RSS feeds (e.g. /blog.rss.yml -> /blog.rss.xml) where the feed meta-data and an internal "content URL" like the one in the template example above are used to generate an XML RSS feed.

Finally, any files that are not .md or .yml in the content directory are copied directly (symlinked) into the final public document root, so that all static assets like images, JS and CSS can be kept versioned with the rest of the content and the entire document root is managed by the generator script.

Publishing

Content is edited through file system and kept version-controlled in a git repository along with the templates and (currently) the node app that generates the site.

I installed a post-update hook in the git repo on the web-server that automatically checks out HEAD and re-runs the generation script. So I can deploy changes by editing files locally, committing, and then running git push production.

I have toyed with the idea of building a web interface for editing. In fact I did have a working prototype using EpicEditor and a node.js REST API (using restify) for editing in an earlier version of the system. But, having settled on the simplicity of fully version-controlled content and no daemons or security to worry about on the server, I'm sticking with local edit and git-push deployment for now.

I am using Mou to write this right now with instant, correctly-styled preview. It works really well, especially when tied it into Sublime Text 2 which I am using to edit the rest of the templates and js files.

Conclusion

I like it. It's been fun to think about and build and has lots of potential for future tinkering.

I may even stick a skeleton version of the site with generation scripts etc. on github although I doubt anyone could have a real desire to use this over one of the more widely used and much better-tested options I listed above.

Now I just need to try to focus on producing some interesting content…

 

PHP Arrays (Again)

I have mentioned PHP array inefficiency a few times on this blog.

Discussing it at work today someone linked me to a much more thorough review of the topic that is interesting and readable.

I'm finding myself so much more interested in this level of stuff than the typical PHP programmer which I guess is why I spend my free time playing with other, generally statically typed, languages...