Building Tech that Over-Performs Under Pressure

Intro

PRPL recently helped LEGO® provide nearly 6,000 comic books to attendees at San Diego Comic-Con (SDCC), the largest comic convention in the world. Simple, right?

Add in some complexities, like custom-made comic books, user-generated characters, dynamic stories, and on-demand printing in under 5 minutes, and things suddenly become far less simple.

To make these books, we had to choose a tech stack that could handle an insane amount of throughput while not sacrificing on quality. Additionally, whatever architecture we landed on needed to be fault-tolerant with as few single points of failure as possible.

Tackling an Exciting Challenge

With so many variables and moving parts, our decisions were very calculated, and we intentionally pushed the limits of certain components of the architecture. For example, while we didn’t expect every single one of the 130,000 attendees to interact with the LEGO booth in only a few days, we designed for a throughput that could churn out that many comic books if needed. Evenly dividing the attendees over the span of time the floor was open, equaled out to about 1 comic book per second. Through parallel processing and efficient code, we built a system that could handle that throughput with ease.

While keeping that efficiency in mind, our goals were also focused on reliability and exceptional quality despite hardware constraints. Limited processing power on the touchscreen hardware meant our touchscreen software needed to use as little memory and GPU as possible.

At the same time, the development timeline was only 8 weeks long. To account for this, we chose an architecture that would allow us to share code between different components of the experience.

How We Architected the “Impossible”

Monorepo

Rather than having individual repos for each component of the experience, we created a monorepo so we could share code between components. This kept the codebase more DRY (don’t repeat yourself) as well as accounted for the short development timeline.

Both the server and client-side touchscreen software used javascript. Node.js was able to be utilized for a server-side API and post-processing setup, while React and ThreeJS were used for the touchscreen software. The same ThreeJS code that was rendering the Minifigure preview on the touchscreens was also being used to render the attendee’s Minifigure within their unique comic book.

Chromium for Generating Comic Books

We were able to find libraries that could render ThreeJS scenes in Node.js and output the scene to an image file. However, we quickly discovered that the render quality of the virtual canvas in Node.js wasn’t up to the standard we were aiming to meet. Realizing we had already seen great results in the browser, we decided not to reinvent the wheel and simply used the power of the browser to give us high quality renders.

To render comic book panels, we wrote a system that would spin up a headless Chromium process and mount the application there. It used Chrome’s rendering engine to draw the scene, and then we output the HTML canvas as a base64-encoded image. The base64 images for each scene were then compiled together into a PDF and sent to the printer.

Queueing and Parallel Processing

Our initial load tests showed that comic books took up to 30 seconds to render. This wouldn’t meet the goal of being able to scale to 1 comic book per second. For that to happen, the obvious choice was parallel processing. We wrote a locking mechanism that would allow more than one process to run in parallel without conflicting with each other.

Let’s imagine a queue with 10 pending items in it and 2 processes running in parallel to work through that queue. Without proper locking, there’s a chance the 2 processes could attempt to render the same comic book at the same time. Applying this example to what we built, here’s how the events would occur when using locking:

Process A and Process B would be started
Both processes would attempt to lock the first free item in the queue that they could find
Whichever process locked it first would gain ownership over it (let’s say it’s Process A in this example)
Process A would begin rendering comic book #1
Process B, having failed to lock item 1, would move on to item 2 and attempt to lock it
Process B would begin rendering comic book #2
Process A would finish rendering comic book #1 and send it to the printer to be printed
Process A would update the status of comic book #1 to reflect that it’s been sent for printing
Process A would find the next unlocked item in the queue and attempt to lock it (comic book #3)
While Process A is busy rendering comic book #1, Process B would follow the same steps (in parallel) to render comic book #2

This example shows how worker processes would make their way through the queue. When there were no items in the queue, all the worker processes would just wait until something was placed in the queue, then race to lock/claim the next one that came in.

This system got us much closer to our goal of 1 comic book per second in peak load, but not entirely there. We moved on and planned to revisit the performance later to get us to our goal.

Performance Optimization

Within both the touchscreen software and the Node.js comic generator, we were using high-fidelity assets provided by LEGO. These were the same 3D models used in their movies and video games. This meant they had a lot (and I mean a LOT) of polygons.

Pretty early on in the development process, we started to experience crashes in the touchscreen software because it was running out of memory. We optimized the 3D files to reduce the number of polygons, and we set up a dynamic loading architecture. Originally, all the assets (hats, faces, torsos, etc.) were always loaded and ready to be used. This changed once we implemented the dynamic loading system. With this system, we generated a manifest that defined the assets but didn’t directly include them until they were needed. For example, when an attendee would tap on a different torso, we unassigned the current torso textures and deallocated the memory they had used. Then we loaded the new ones into memory, assigned the textures to the torso geometry, and then triggered a redraw of the canvas.

The only downside to this approach was that frame rate would drop slightly when a different asset was tapped, but it wasn’t significant enough for the human eye to even register the brief change in frame rate. Overall, it was worth it because it solved the performance issues we had been experiencing.

Using a monorepo proved very useful because once the touchscreen app’s performance was improved, it caused server-side render times to fall to just 10 seconds per comic book. Now, we were able to meet our goal performance metrics with only 10 parallel worker processes. Our onsite server had almost 20 cores, so 10 processes ran without issue.

Network Architecture

It’s easy to default to what you’re used to doing. Typically, relying on an internet connection is acceptable for most projects. But at SDCC, we needed the ability to operate at full capacity without the need for a consistent internet connection.

All components of the booth communicated over a private network, and calls to internet services were built in an asynchronous, non-blocking way.

The touchscreens communicated with the server via GET and POST requests against a Node.js API. The Node.js process would respond to API requests and update local database records (and the comic book queue) when necessary. Separate worker processes on the server would listen for changes in the queue and respond accordingly. When a comic book was ready for printing, the worker process would send it to the locally-networked printer using the Samba networking protocol.

Additionally, we used Twilio to send SMS notifications to attendees, and we would email them the cover of their comic book. These integrations required an internet connection but would queue the email for later if the internet connection was lost.

Many potential points of failure were mitigated with this locally networked configuration, and WAN speeds weren’t a factor in the throughput speed of the booth.

Code Generation

What’s been described so far, I’m admittedly very proud of building. But what’s possibly most impressive is our home-grown code generation pipeline.

When software is this complex, it can be tough to maintain. Any change in assets can impact multiple parts of the codebase and can mean a lot of manual work by engineers and designers.

Let’s take faces, for example. Each face option had a preview in the touchscreen app. This preview was a PNG rendering of the face on the head geometry. There was also an optimized texture for the touchscreen app, as well as a high-fidelity texture for the server-side comic generator. Those are all pieces that would normally need to be manually generated. And that’s describing the simplest of all the options. Options with geometry, such as hats, would also need 2 versions of the 3D models: a high-fidelity version for the server-side comic generator, and a lower-poly version for the touchscreen app.

Changing these assets would have been increasingly more time-consuming had we not built the code-generation pipeline. A new face would mean changing asset manifests and re-rendering multiple previews and textures. Swapping out a headpiece would mean rendering a 3D scene to generate the preview, extracting and compressing all the textures, and creating multiple versions of the geometry with various levels of detail.

We wrote the code-generation pipeline to always run on our computers, listening for changes in certain project folders. When changes would occur, the asset manifests would be rewritten, the previews would be generated, and all the assets would be optimized in every necessary level of detail. All we had to do to change assets was drop new files into a folder. Work that would have taken hours happened in seconds.

Expecting the Unexpected

Temporary Network Outage

On media day, a line that was supplying internet connectivity was accidentally cut by another company running setup at the convention center. Not a big deal, except for the fact that it occurred minutes before the floor opened to all the members of the media.

Luckily, the architecture outlined above was built with risks like this in mind. LEGO visitors never experienced an interruption in designing and printing their comics because of the locally networked components. And thanks to our queue system, all visitors were emailed their custom comic’s cover once the internet connection was restored.

Word Filtering

Users had countless design options to choose from in building their Minifigures, and part of this customization was the character’s name. In anticipation of the playfully curious nature of people, we implemented a word filter. Not a single crude name made it through the word filter, and every innocent name made it through without issue.

Looking Back

In every sense, the project was a major success. In retrospect, there’s not much I would have done differently. We were able to build an amazing experience with ingenuity and determination. Every engineer who had a part in this project brought a unique perspective to the table, and we built something together that surpassed the expectations of not only LEGO but every attendee who participated in the LEGO booth experience.