Skip to main content

Theoretical and Empirical Comparison of Big Data Image Processing with Apache Hadoop and Sun Grid Engine

Posted by on Tuesday, November 1, 2016 in Big Data, Cloud Computing.

Shunxing Bao, Frederick D. Weitendorf, Andrew J. Plassard, Yuankai Huo, Aniruddha Gokhale, Bennett A. Landman. “Theoretical and Empirical Comparison of Big Data Image Processing with Apache Hadoop and Sun Grid Engine”. Orlando, Florida, February 2017. Oral presentation.

Full Text:

Abstract

Traditional large scale processing uses a cluster computer that combines a group of workstation nodes into a functional unit that is controlled by a job scheduler. Data transfer from storage to processing nodes can saturate network when data is frequently uploaded/retrieved from the NFS. An alternative approach using Hadoop and HBase was presented for medical imaging to enable co-location of data storage and computation while minimizing data transfer. Theoretical models for wall-clock time and resource time for both approaches are introduced and empirically validated. A comparative analysis is presented for when the Hadoop framework will be relevant for medical imaging.

 

Hadoop and SGE data retrieval, processing and storage working flow basing on Multi-atlas CRUISE (MaCRUISE) segmentation [14, 15]. The data in an HBase table is approximately balanced to each Regionserver. The Regionserver collocates with a Hadoop Datanode to fully utilize the data collocation and locality[7]. We design our proposed computation models using only the map phase of Hadoop’s MapReduce [13]. In this phase, the data is retrieved locally; if the result were moved to reduce phase, more data movement would occur, because the reduce phase does not ensure process local data. Within the map phase, all necessary data is retrieved and saved on a local directory and gets furtherly processed by locally installed binary executables command-line program. After that, the results of processing are uploaded back to HBase. For SGE, the user submits a batch of jobs to a submit host, and this host dispatches the job to execution hosts. Each execution host retrieves the data within a shared NFS and stores the result back to the NFS.
Hadoop and SGE data retrieval, processing and storage working flow basing on Multi-atlas CRUISE (MaCRUISE) segmentation [14, 15]. The data in an HBase table is approximately balanced to each Regionserver. The Regionserver collocates with a Hadoop Datanode to fully utilize the data collocation and locality[7]. We design our proposed computation models using only the map phase of Hadoop’s MapReduce [13]. In this phase, the data is retrieved locally; if the result were moved to reduce phase, more data movement would occur, because the reduce phase does not ensure process local data. Within the map phase, all necessary data is retrieved and saved on a local directory and gets furtherly processed by locally installed binary executables command-line program. After that, the results of processing are uploaded back to HBase. For SGE, the user submits a batch of jobs to a submit host, and this host dispatches the job to execution hosts. Each execution host retrieves the data within a shared NFS and stores the result back to the NFS.