How Azure Data Explorer Was Able to Accelerate Namogoo’s Classification Processes 170X Faster
March 13, 2019
by Dor Baz, Ohad Greenshpan,
“The sea, once it casts its spell, holds one in its net of wonder forever”, so said Jacques Cousteau, the world-famous sea explorer. Inspired by his forays into the oceans, Microsoft’s Azure Data Explorer has codenamed its data analytics service “Kusto”, which efficiently facilitates explorations into the sea of Big Data.
So, what is Azure Data Explorer about?
Introduced in fall of 2018, Azure Data Explorer (ADX) cloud service is capable of ingesting, storing, and quickly completing interactive and complex data analytics queries on large volumes of structured, semi-structured, and unstructured real-time data, all the while maintaining a low latency of delay between ingestion and query.
Advertised as a “fast and highly scalable data exploration service”, the Azure Data Explorer offers, among other features:
- A variety of data ingestion techniques
- A powerful and simple query language with support for machine learning analysis, as well as support for querying through SQL
- Data analysis capabilities using Python and Jupyter Notebook support
- Added value with plugins for visualizing data through dashboards (Grafana)
- API support for customizing searches and analysis
And, what does Namogoo’s architecture require?
For this entire procedure to be practical and efficient, it is vital that the data analytics tool that is used for ingestion, querying, and analyzing is quick, scalable, easy to use and super-efficient.
Life before Azure Data Explorer
Using AWS cloud services, the procedure detailed above is multi-step, as in each script is downloaded into S3 and over to Redshift and RDS. More than 250 different searches are run on this data to get viable results. This entire process (from ingestion to query and analyses) takes over 7 hours – time that is unacceptable at this point and obviously impedes growth. It should be noted that Namogoo uses Redshift as one of its core databases to store raw data for research and business analytics. While we are using it to analyze large volumes of data and for a variety of use cases, we believe it is important for us find the tool best-suited to solve the challenge faced in this specific scenario. Since this analytical challenge involves interactive analytics over big volumes of full-text data – we found Azure Data Explorer as the best-fit solution for this particular use case.
Life after Azure Data Explorer
With ADX’s competency, the process is more streamlined. Using ADX’s full text indexing and retrieval, regular expression evaluation and text parsing, when over 150 searches are run, the results are obtained in 2.5 minutes, a remarkable improvement over our existing AWS-based infrastructure. Below you can see the difference between the two architectures:
ADX uses a traditional relational data model to organize its data. Data is organized in tables, with the data records being a “strongly-typed schema” that is ordered in columns with a name and a scalar data type. Given that the scalar data type can be structured, semi-structured, or free text, this allows for more efficient searches. Full scripts and metadata can be ingested into ADX, allowing use of the metadata as filters during searches, hence further increasing accuracy and better data collection. Also, as mentioned above, ADX offers API support, further enabling customized, specific string or regex searches.
All in all…
“Some of the best theorizing comes after collecting data because then you become aware of another reality.” – Robert J. Shiller, Winner of the Nobel Prize in Economics
And if Azure Data Explorer is the guiding light while spring cleaning your data, let the theorizing begin!
You can find more information on Azure Data Explorer here.