VU BreakThru

Home » News » Utilizing the Big Data Infrastructure at Vanderbilt

Utilizing the Big Data Infrastructure at Vanderbilt

Posted by on Monday, October 2, 2017 in News, TIPs 2015.

Joshua_Arnold_small_square

Josh Arnold

Written by Josh Arnold, Application Developer at ACCRE

The Advanced Computing Center for Research and Education (ACCRE) is in the process of building a production-scale big data computing environment thanks in large part to a 2015 Vanderbilt Trans-Institutional Program (TIPs) award, but the past several months have seen Vanderbilt researchers utilize ACCRE’s proof of concept big data test environment as well. ACCRE is currently managing a test Hadoop environment that affords Vanderbilt researchers and ACCRE administrators valuable experience interacting with these relatively new technologies.

One of the great advantages of the big data world is the rich ecosystem of tools built around the core libraries. Two research groups are using Hadoop extensions to their normal software stacks to tackle high-profile questions in the world of medicine. Professor Andries Zijlstra has long been utilizing KNIME for analyzing images of human tissue cells to aid in the diagnosis of certain types of cancers but is now able to leverage the Big Data extensions to increase throughput. Professor John Graves in the Department of Health Policy and the Department of Medicine is exploring integration of existing codes written in the popular statistical software R with the SparkR library, which disguises Spark’s resilient distributed dataset (the core data structure for Spark computations) as the familiar R dataframe. Prof. Graves aims to run highly parallelized stochastic simulations to predict patient outcomes based on drug effectiveness.

The Hadoop ecosystem is rich with native tools as well, allowing researchers to write clean and concise code in Python, Scala, and Java. Two Vanderbilt research groups in particular are leveraging the test Hadoop cluster for large scale analytics using these native tools.

Accre_big-data_for-web
Professor Gene LeBoeuf’s water resources group is currently modeling two reservoirs on the Cumberland River using the high-fidelity, two-dimensional hydrodynamic and water quality model, CE-QUAL-W2, written in Fortran. To identify uncertainties in the model, they are performing a generalized sensitivity analysis, randomly sampling the input parameters over thousands of simulations and quantifying the model’s response. Although the CE-QUAL-W2 model runs on specialized hardware, Prof. LeBoeuf’s group uses PySpark on Hadoop to aggregate the model outputs and to analyze their dependence on the input parameters.

Professor Catherine F. Lee of the Owen Graduate School of Management has teamed up with Clifford Anderson of the Heard Library to understand the relationship between the language of earnings calls and financials analysts’ quarterly earning projections using Spark’s native Scala API. Parsing over 200,000 transcripts of earnings calls spanning over two decades, their analysis combines conventional dictionary-based sentiment-scoring techniques with state-of-the-art natural language processing methods to map human conversations to quantifiable features that may indicate how an analyst will project company performance.

As the big data computing paradigm continues to gain momentum, ACCRE doesn’t anticipate the demand for big data infrastructure to slow any time soon. In particular, genomics, scalable machine learning and natural language processing are three areas we anticipate adding great value to Vanderbilt research. If you’d like to learn more about you can use the test and production big data clusters, leave a question or comment below and check out the ACCRE website.


Leave a Reply

Back Home   

Recent Posts

Browse by Month