BigData Processing & MapReduce in a .NET NoSQL Database

Note: Now you can write .NET code for MapReduce in NosDB. MapReduce allows you to perform complex computations and distribute your load among the database shards.

Big data analytics is being woven into how businesses work, which means organizations are continuously collecting and processing terabytes of data generated from varying sources. The enormity and variety of data adds a lot of computation expense to the business. One SQL query on a single server with huge data set has become a computationally expensive task. Also, depending on the data size and hardware, it can sometimes take days to generate the result.

Because analytics and intelligence tasks access a wide variety of data types from multiple sources, the data-store should also easily handle every type of schema.

  • So, is there a data store than improves big data performance and reduces time?
  • And what other storage options are available besides RDBMS?

This is where a distributed NoSQL database steps in, as it serves as a distributed computational platform as well as a distributed data store.

NoSQL is a Great Answer

NoSQL document databases specialize in storing unstructured data without pre-defined schemas. NoSQL databases also distribute processing power to multiple machines, forming a cluster, based on commodity hardware. This means that instead of processing everything on a single node, a NoSQL database processes on all the nodes of a cluster – and very fast.

Each node locally processes its assigned “portions” of the distributed data, accumulates the results and generates an answer. Hence distributed computing is a technological advancement which enables big data analytics, and several algorithms have been introduced to implement it.

MapReduce – A Distributed Computing Algorithm

One such algorithm for distributed computing is MapReduce. MapReduce is a server-side framework. The framework includes two separate functions – Map and Reduce.

  • The Mapper processes data locally on the server nodes, in parallel, and sends the processed data to the Reducer.
  • The Reducer runs on a single node to aggregate the data and return the result to the client.

This approach makes extreme data volume and computational complexity a straight forward matter. Because of this MapReduce is now recognized as a standard algorithm by institutions and organizations worldwide, and it is being integrated into many NoSQL databases to handle complex computations. MapReduce has been customized for various technologies and platforms.

MapReduce in .NET NoSQL Databases

NosDB is a 100% native .NET NoSQL database, and as such it hosts a distributed computing environment. Naturally NosDB supports MapReduce. Because of this, data intensive industries such as e-commerce, finance, government and health can get real-time, big data analytics results with native .NET tools.

NosDB’s implementation of MapReduce provides two .NET interfaces as shown below:

  1. IMapper runs on every node in parallel, and transforms values from the local data into meaningful information in key-value pairs.
  2. IReducer runs on a single node and accumulates the results from the Mapper to process the final result. This result is then provided to the client.

With this introduction to how MapReduce works in a distributed environment, you can discover the various ways MapReduce applies to practical business applications. To see how MapReduce helps, let’s take a simple case study.

Airline Company Data Analytics – A Case Study

Let us consider a scenario of an airline that has constant data input of all incoming and outgoing flights. A few data attributes could include:

Flight Details Flight Times Delays Status
FlightNumber DepTime ArrivalDelay Cancelled
Origin ArrivalTime DepDelay Diverted
Destination ActualElapsedTime CarrierDelay On Schedule
Distance AirTime LateAircraftDelay Delayed

This opens a whole stream of questions that can be asked to benefit the business, for example:

  • Busiest Cities
  • Top 20 airports by volume of flights
  • Carrier popularity
  • Best time to travel by location
  • Detect cascading failures; delays in one airport creates delays in others.
  • And more

Analyzing the data to answer these questions leads to the business making better decisions to enhance customer experience and improve revenues. Considering the humongous amount of data stored for flight details, it is useful if such heavy analyses are made at the server end, with only the result sent over the network – which reduces network latency and network costs. Hence, the distributed nature of MapReduce is an optimum framework to carry out computations on a distributed NoSQL database.

Let’s have a look at how NosDB MapReduce answers the following question:

What are the top 3 destinations for flights originating in Seattle?

We begin with optimizing the MapReduce functionality by querying for only those documents which have the origin as “Seattle” i.e. the result set includes only those documents which have the field “Origin: Seattle”, reducing the data burden on the nodes. MapReduce can now be executed on the relevant data.

The following diagram provides a picture of the computational sequence:

MapReduce

Mapper

With NosDB, the Mapper code is written in .NET and runs inside the same process as the database. This eliminates the serialization and deserialization cost incurred from accessing the data were it to be kept in a different process. Also, the Mapper is always executed in parallel on each node of the NoSQL database. Each node in the database cluster sifts through its local collections and outputs key-value pairs as shown in the ‘Mapper’ column in the diagram above.

Reducer

The Reducer combines the key-value pairs generated from all Mappers into a sorted dictionary and extracts the result i.e. the top 3 cities to be sent over the network to the .NET client. As shown in the diagram, Chicago occurs once in Node1, twice in Node2 and twice in Node3. The Reducer combines these values on one node as “Chicago: 5”. After calculating the occurrences for all cities, the Reducer sorts the data in descending order to reveal the top 3 cities. This reduced result is then sent over the network to the client.

While this is an abridged example of the MapReduce process, NosDB Documentation provides an in-depth explanation of MapReduce interfaces and execution in the NosDB NoSQL database.

Conclusion

Technology has advanced so that you can use a .NET NoSQL database for advanced analytics and real-time operational results, even when processing terabytes of live production data.  Using NosDB as a MapReduce engine on a .NET distributed database decreases your learning curve to accomplish this, because it is integrated into the .NET stack.  So you can get up and running using tools you already know and enjoy a phenomenal ROI while moving your organization into the world of big data analytics.

quic-start-demonosdb-download

Leave a Reply

Your email address will not be published. Required fields are marked *