Document clustering at lightning speed

Written by Justin Graves | July 6, 2023

Behind the scenes building New Narratives

This is the first in a series of technical blog articles, giving you some insight into the endless, obsessive passion that builds our software here at Infegy. We have big ambitions, and our brilliant engineering team never ceases to make our dreams a reality, thanks to a deep understanding of modern systems and an ability to extract all they have to offer.

Not your standard document bucketing

At Infegy, we are continuously working on ways to better explain what's going on in data in order to more quickly get to the "why." One of our recent developments, Narratives, is a powerful tool that groups data within your query into "buckets" - each bucket containing documents that are syntactically similar to each other. Additionally, we link across buckets so you can see how different clusters relate to each other. This is similar in idea to a process known as latent semantic analysis.

This type of algorithm is considered to be fairly well-understood - in fact, we already had a previous version running in Infegy Atlas. For Narratives, however, our ambitions were set very high. To build these buckets, we need to do complex comparison operations, ideally on every possible pair of documents. For example, out of just 1,000 documents, there are 499,500 possible ways to pick only two (1,000 choose 2). These combinations get very computationally expensive as you increase the input size.

^{Figure 1: Example of thousands of documents clustered by conversation; Infegy Atlas data.}

Building speed and computational efficiency

Additionally, these comparisons analyze the similarity between the content of the documents, which could be hundreds of words long. You can see how this quickly adds up and would be challenging to execute in time to maintain Atlas's famous immediacy. Yet we did! This system does exactly that, doing in-depth and robust comparisons between what can exceed 100,000 documents for your query, grouping them together, and determining things like aggregate sentiment, gender, median age, trends, and more, plus returning this wealth of information to you in less than a second!

So how on Earth did we do that? Surely we took some shortcuts... Well, no. As with all of Atlas, this algorithm is written in C++, using code optimized to the hardware we run on to extract the maximum possible performance. We set data up in such a way as to ensure it is densely packed and aligned for use of vectorization (SIMD) and use such instructions to do the needed work at the highest possible throughput. The document comparison algorithm, for example, performs 2.2 trillion document comparisons per second from a single server! Comparisons are done on dense bitfields, using efficient operations to combine them, and we can then count overlap using a set of population count instructions. CPUs have specific instructions to handle bitwise operations. These instructions execute in just one or two CPU cycles, so they’re incredibly fast and scale really well.

The result of this incredible throughput is the most powerful system of this kind available. And on top of that, our API will output clustering information for up to 100,000 documents at once! This can generate some beautifully-dense graphics, giving an almost artistic look at a conversation. Because you can get the data directly from the API, you can also build your own visualizations within and around clusters.

^{Figure 2: The same data as Figure 1, but viewed with sentiment; Infegy Atlas data.}

Takeaways

Infegy Atlas has revolutionized document clustering with its lightning-fast speed and computational efficiency, showcasing Infegy's commitment to providing insightful, efficient, and scalable data analysis solutions. We leverage optimized, hardware-specific C++ code to perform 2.2 trillion document comparisons per second, resulting in the most powerful system of its kind. With the ability to cluster up to 100,000 documents at once and an API that allows for custom visualizations, we empower users to gain deep insights and create visually stunning representations of data.

Interested in learning more? Schedule your demo today.

View full post