How Azure Data Explorer Was Able to Accelerate Namogoo’s Classification Processes 170X Faster

Namogoo Team
  • Namogoo Team
  • March 13, 2019

“The sea, once it casts its spell, holds one in its net of wonder forever”, so said Jacques Cousteau, the world-famous sea explorer. Inspired by his forays into the oceans, Microsoft’s Azure Data Explorer has codenamed its data analytics service “Kusto”, which efficiently facilitates explorations into the sea of Big Data.

So, what is Azure Data Explorer about?

Introduced in fall of 2018, Azure Data Explorer (ADX) cloud service is capable of ingesting, storing, and quickly completing interactive and complex data analytics queries on large volumes of structured, semi-structured, and unstructured real-time data, all the while maintaining a low latency of delay between ingestion and query.

Advertised as a “fast and highly scalable data exploration service”, the Azure Data Explorer offers, among other features:

  • A variety of data ingestion techniques
  • A powerful and simple query language with support for machine learning analysis, as well as support for querying through SQL
  • Data analysis capabilities using Python and Jupyter Notebook support
  • Added value with plugins for visualizing data through dashboards (Grafana)
  • API support for customizing searches and analysis

And, what does Namogoo’s architecture require?

Over time, Namogoo has stored hundreds of millions of JavaScript code snippets and scripts that it has collected from sessions of online users. Around 1M new scripts are gathered every day and added to this repository. By searching for specific search terms and links inside these stored scripts, new malicious code snippets are found. Analyses and results of these searches are used to identify unauthorized elements in subsequent visitor sessions through data inference, values of which are machine learned and fed back to make the process smarter.

For this entire procedure to be practical and efficient, it is vital that the data analytics tool that is used for ingestion, querying, and analyzing is quick, scalable, easy to use and super-efficient.

Life before Azure Data Explorer

Using AWS cloud services, the procedure detailed above is multi-step, as in each script is downloaded into S3 and over to Redshift and RDS. More than 250 different searches are run on this data to get viable results. This entire process (from ingestion to query and analyses) takes over 7 hours – time that is unacceptable at this point and obviously impedes growth. It should be noted that Namogoo uses Redshift as one of its core databases to store raw data for research and business analytics. While we are using it to analyze large volumes of data and for a variety of use cases, we believe it is important for us find the tool best-suited to solve the challenge faced in this specific scenario. Since this analytical challenge involves interactive analytics over big volumes of full-text data – we found Azure Data Explorer as the best-fit solution for this particular use case.

Life after Azure Data Explorer

With ADX’s competency, the process is more streamlined. Using ADX’s full text indexing and retrieval, regular expression evaluation and text parsing, when over 150 searches are run, the results are obtained in 2.5 minutes, a remarkable improvement over our existing AWS-based infrastructure. Below you can see the difference between the two architectures:

Azure Data Explorer vs AWS Architecture

ADX uses a traditional relational data model to organize its data. Data is organized in tables, with the data records being a “strongly-typed schema” that is ordered in columns with a name and a scalar data type. Given that the scalar data type can be structured, semi-structured, or free text, this allows for more efficient searches. Full scripts and metadata can be ingested into ADX, allowing use of the metadata as filters during searches, hence further increasing accuracy and better data collection. Also, as mentioned above, ADX offers API support, further enabling customized, specific string or regex searches.

All in all…

 “Some of the best theorizing comes after collecting data because then you become aware of another reality.” – Robert J. Shiller, Winner of the Nobel Prize in Economics

And if Azure Data Explorer is the guiding light while spring cleaning your data, let the theorizing begin!

You can find more information on Azure Data Explorer here.