VU BreakThru

Home » News » Update on “A Trans-Institutional Big Data Infrastructure at Vanderbilt”

Update on “A Trans-Institutional Big Data Infrastructure at Vanderbilt”

Posted by on Thursday, January 19, 2017 in News, TIPs 2015.

Will French_2

Will French, ACCRE’s Manager of Research Computing Operations

Written by Will French, ACCRE’s Manager of Research Computing Operations

The world is drowning in data. This is the reality that has become increasingly apparent across virtually all areas of industry, medicine, business, higher education, and research, among other sectors, over the last several years. In 2012, IBM stated that 2.5 exabytes (i.e. 2.5 billion gigabytes) of data were being generated worldwide per day. The challenge of dealing with “big data” – that is, effectively managing the flood of data being generated and extracting useful information from it – is undoubtedly one of the major challenges of the 21st century.

A number of researchers at Vanderbilt are actively exploring methods for tackling big data problems. While the term “big data” can take on many different meanings for distinct research domains and areas of inquiry, the inherent challenge is often the same: how does one increase the scale of data without running a computer out of memory, or waiting days or weeks for an analysis process to complete? Many methods for data storage, management, and analysis simply do not work as efficiently or effectively once the size of data crosses some critical threshold. Web companies such as Google and Yahoo began encountering these challenges in the early 2000s and developed new technologies that are only now making their way into academic research.

The Advanced Computing Center for Research and Education (ACCRE) is currently being supported by the university to build a data storage and computing environment that is designed and optimized for big data, which launched in 2015 thanks in large part to a Vanderbilt Trans-Institutional Program (TIPs) award. This new environment is centered around the ecosystem called Hadoop that is widely used at large web companies and within the data science industry today. Over 100 Vanderbilt and VUMC research groups currently use ACCRE resources for a variety of demanding storage, backup, and computing applications. However, the current ACCRE environment was designed with assumptions about how data are stored and analyzed that may limit its ability to efficiently tackle big data problems. This new environment will be designed specifically with the challenges of big data in mind.

ACCREACCRE is currently managing a test Hadoop system that is accessed by a number of researchers and students across campus, allowing them to test the benefits of the new system. The test system also allows ACCRE staff to gain the expertise needed to run and manage the system, while also providing invaluable insight into the specific challenges within each researcher’s applications. A new production-scale system will be purchased and deployed to the campus over the next few years. Once available, the production system will open up research opportunities for problems that were previously inaccessible. Additionally, researchers already making use of the Hadoop ecosystem can continue their analysis but at a significantly larger scale.

One of the large users of the current environment is a group of over 50 students in Professor Daniel Fabbri’s course in big data (CS 4266/5266). This is the third year in a row that ACCRE has managed the environment for Professor’s Fabbri’s course. Students access the system remotely, take part in detailed training sessions led by Professor Fabbri and ACCRE staff, and complete homework assignments on big data problems, such as mining data on Wikipedia to search for interesting trends in topics or content. These experiences are invaluable for Vanderbilt students, said Fabbri: “Course projects previously were limited to small data sets. Now, with the cluster, students are analyzing much large data sets and are able study real-world problems.” This real-world experience is a huge benefit of using the Hadoop environment managed by ACCRE. “Students can now experience the power and pain of working with big data sets with the cluster. These experiences will make Vanderbilt students extremely competitive in the job market,” notes Fabbri.

In addition to building a new big data system, a significant portion of the ACCRE TIPs project is also devoted to creating immersive educational experiences for undergraduate students through a ten-week ACCRE Scholars summer research program in which students use ACCRE resources to complete a research project with a Vanderbilt faculty member. This past summer, six Vanderbilt undergraduate students participated in the inaugural program in topics ranging from electronic structure calculations to single nucleotide polymorphisms and computational fluid dynamics. Two of the summer students also had the opportunity to travel to the annual Supercomputing conference held in Austin, Texas to present their summer research projects and interact with other students and researchers in the community. This summer, ACCRE plans to host a similar number of students as it continues to build and develop this exciting new big data environment.

Be sure to visit the blog page often for updates on our TIPs project. We also encourage you to leave comments or ask questions in the space provided below.


One Comment on “Update on “A Trans-Institutional Big Data Infrastructure at Vanderbilt””

WONDERFUL Post.thanks for share.

Arint conferences on May 16th, 2019 at 6:44 am

Leave a Reply

Back Home   

Recent Posts

Browse by Month