Share this
Document clustering at lightning speed
by Justin Graves on July 6, 2023
Behind the scenes building New Narratives
This is the first in a series of technical blog articles, giving you some insight into the endless, obsessive passion that builds our software here at Infegy. We have big ambitions, and our brilliant engineering team never ceases to make our dreams a reality, thanks to a deep understanding of modern systems and an ability to extract all they have to offer.
Not your standard document bucketing
At Infegy, we are continuously working on ways to better explain what's going on in data in order to more quickly get to the "why." One of our recent developments, Narratives, is a powerful tool that groups data within your query into "buckets" - each bucket containing documents that are syntactically similar to each other. Additionally, we link across buckets so you can see how different clusters relate to each other. This is similar in idea to a process known as latent semantic analysis.
This type of algorithm is considered to be fairly well-understood - in fact, we already had a previous version running in Infegy Atlas. For Narratives, however, our ambitions were set very high. To build these buckets, we need to do complex comparison operations, ideally on every possible pair of documents. For example, out of just 1,000 documents, there are 499,500 possible ways to pick only two (1,000 choose 2). These combinations get very computationally expensive as you increase the input size.
Figure 1: Example of thousands of documents clustered by conversation; Infegy Atlas data.
Building speed and computational efficiency
Additionally, these comparisons analyze the similarity between the content of the documents, which could be hundreds of words long. You can see how this quickly adds up and would be challenging to execute in time to maintain Atlas's famous immediacy. Yet we did! This system does exactly that, doing in-depth and robust comparisons between what can exceed 100,000 documents for your query, grouping them together, and determining things like aggregate sentiment, gender, median age, trends, and more, plus returning this wealth of information to you in less than a second!
So how on Earth did we do that? Surely we took some shortcuts... Well, no. As with all of Atlas, this algorithm is written in C++, using code optimized to the hardware we run on to extract the maximum possible performance. We set data up in such a way as to ensure it is densely packed and aligned for use of vectorization (SIMD) and use such instructions to do the needed work at the highest possible throughput. The document comparison algorithm, for example, performs 2.2 trillion document comparisons per second from a single server! Comparisons are done on dense bitfields, using efficient operations to combine them, and we can then count overlap using a set of population count instructions. CPUs have specific instructions to handle bitwise operations. These instructions execute in just one or two CPU cycles, so they’re incredibly fast and scale really well.
The result of this incredible throughput is the most powerful system of this kind available. And on top of that, our API will output clustering information for up to 100,000 documents at once! This can generate some beautifully-dense graphics, giving an almost artistic look at a conversation. Because you can get the data directly from the API, you can also build your own visualizations within and around clusters.
Figure 2: The same data as Figure 1, but viewed with sentiment; Infegy Atlas data.
Takeaways
Infegy Atlas has revolutionized document clustering with its lightning-fast speed and computational efficiency, showcasing Infegy's commitment to providing insightful, efficient, and scalable data analysis solutions. We leverage optimized, hardware-specific C++ code to perform 2.2 trillion document comparisons per second, resulting in the most powerful system of its kind. With the ability to cluster up to 100,000 documents at once and an API that allows for custom visualizations, we empower users to gain deep insights and create visually stunning representations of data.
Interested in learning more? Schedule your demo today.
Share this
- September 2024 (3)
- August 2024 (4)
- July 2024 (2)
- June 2024 (1)
- May 2024 (2)
- April 2024 (2)
- March 2024 (3)
- February 2024 (3)
- January 2024 (2)
- December 2023 (3)
- November 2023 (4)
- October 2023 (3)
- September 2023 (3)
- August 2023 (4)
- July 2023 (4)
- June 2023 (3)
- May 2023 (4)
- April 2023 (4)
- March 2023 (4)
- February 2023 (4)
- January 2023 (1)
- December 2022 (3)
- November 2022 (4)
- October 2022 (3)
- September 2022 (3)
- August 2022 (2)
- July 2022 (1)
- June 2022 (1)
- April 2022 (1)
- March 2022 (1)
- January 2022 (1)
- December 2021 (1)
- November 2021 (1)
- October 2021 (1)
- June 2021 (1)
- May 2021 (1)
- April 2021 (1)
- March 2021 (1)
- February 2021 (1)
- January 2021 (2)
- November 2020 (1)
- October 2020 (2)
- September 2020 (1)
- August 2020 (2)
- July 2020 (2)
- June 2020 (2)
- April 2020 (1)
- March 2020 (2)
- February 2020 (2)
- January 2020 (2)
- December 2019 (2)
- November 2019 (1)
- October 2019 (1)
- September 2019 (2)
- August 2019 (2)
- July 2019 (1)
- June 2019 (1)
- May 2019 (2)
- March 2019 (2)
- February 2019 (2)
- January 2019 (1)