Computer scientists at the Database Systems Lab, Indian Institute of Science, Bangalore, have developed an innovative tool called “CODD” for testing “big data” applications. This novel tool can help database programmers test and validate software that work on big data. CODD stands for "Constructing Dataless Databases".
Looking into the future, big data applications will have to deal with data sets as large as yottabytes - 1 trillion trillion bytes. They also need massive super computers to work with to process this humongous data. But CODD can test important aspects of such applications on a simple personal computer. The creators of this magic wand are a team of students led by Prof. Jayant Haritsa of the Database Systems Lab (DSL).
“Big Data” is now a hot term used in many contexts. It refers to data sets that are massive and complex. Traditional applications are inadequate to process them. The size of such data is in the order of Petabytes, Zettabytes or more. The data includes text, video, images, social media streams and other forms. Hence they pose challenges with analysis, capture, search, sharing, storage, transfer, visualization and privacy. Three “V”s - volume, velocity and variety characterize them. A big corporate can come up with interesting business insights using big data. The world is faced with a deluge of Big Data.
Any software that is built needs to work as expected. Software testing is a process in the software industry which ensures that a software meets certain specifications and fulfills its intended purpose. "Testing is a vital and grossly under-appreciated aspect of the software industry. Since Big Data is ubiquitous, developing effective tools for testing is of critical importance. CODD is an important step in this direction. It is inexpensive and efficient to test futuristic scenarios. It helps make database engines future-proof" explains Prof. Haritsa.
In reality, testing big data applications is a challenge. One needs to simulate this massive data that the software would operate upon. Providing simulated test data of the size of a yottabytes is a herculean task indeed! CODD makes this process seem simple. It can simulate the "metadata" corresponding to arbitrary futuristic "what-if" scenarios. Metadata summarizes basic information about data. These futuristic scenarios can be input using a simple and elegant graphical interface. Users need to only provide the desired metadata of the data they need.
Apart from generating the metadata, CODD can also validate the same. It ensures correctness and consistency, making sure that the data make sense. This tool is written in the Java programming language and runs on almost all database platforms. CODD is also a potent tool for highlighting design "bugs" in the software. A "bug” is an error, a flaw, a failure, or a fault in a program which produces an incorrect or unexpected result. They can get introduced in the planning components of the database engine. They are hard to detect and can surface at the Big Data scale causing havoc.
CODD can simulate and assess a query optimizer’s behaviour in response to futuristic scenarios. A query optimizer is an organic component of database engines. It determines the most efficient method to access requested data. In the big data scale, identifying the best method to access a piece of data with minimal time is critical.
Equipped with these strengths, CODD has received many accolades. Its ease of use and the value add has garnered interest in the industry and academia. Many leading software multinationals and research universities are using CODD. Hewlett Packard (HP) has recognized the work of the DSL and showered appreciations.
Prof. Jayant R. Haritsa
Supercomputer Education and Research Centre (SERC) and Department of Computer Science and Automation (CSA)
Indian Institute of Science, Bangalore, India
Prof. Jayant R. Haritsa is a Senior Professor at the Computer Science and Automation Department, IISc, Bangalore. A paper on CODD by Ashoke S, a former master’s student at the same department, and Prof. Haritsa was recently published in the Proceedings of 41st International Conference on Very Large Data Bases (VLDB), 2015.