Building scalable indexes that can be efficiently queried

Date/Time
Date(s) - 03/27/2023
3:00 pm - 4:00 pm

Location
Communicore, C1-17

Christina Boucher, Ph.D., Associate Professor, Department of Computer and Information Science and Engineering, University of Florida

Recently, Gagie et al. proposed a version of the FM-index, called the r-index, that can store thousands of human genomes on a commodity computer. We later showed how to build the r-index efficiently via a technique called prefix-free parsing (PFP) and demonstrated its effectiveness for exact pattern matching.  Exact pattern matching can be leveraged to support approximate pattern matching but the r-index itself cannot support efficiently popular and important queries such as finding maximal exact matches (MEMs).  To address this shortcoming, Bannai et al. introduced the concept of thresholds, and showed that storing them together with the r-index enables efficient MEM finding — but they did not say how to find those thresholds.  We present another novel algorithm that applies PFP to build the r-index and find the thresholds simultaneously and in linear time and space with respect to the size of the prefix-free parse.  Our implementation can rapidly find MEMs between reads and large sequence collections of highly repetitive sequences.  Compared to existing methods, ours used 2 to 11 times less memory and was 2 to 32 times faster for index construction.  Moreover, our method was less than one thousandth the size of competing indexes for large collections of human chromosomes.

Bio:

Dr. Boucher is an Associate Professor in the Department of Computer and Information Science and Engineering at the University of Florida. She has over 125 publications in bioinformatics, with over several dozens of them in succinct data structures and/or alignment. She has given keynote addresses at 2022 WABI Pangenomics workshop, HICOMB 2022. IGGSY 2022, SPIRE 2021, RECOMB-SEQ 2016 and the ECCB 2016 Workshop on Pan-Genomics.  She is a recipient of an ESA 2016 Best Paper Award. She oversees the development and maintenance of several software methods, including Moni, MEGARes and AMRPlusPlus, METAMarc, Kohdista, Vari, VariMerge — and most recently, Moni. In addition, she has built a team of collaborators in various biomedical sciences including microbiology, veterinarian medicine, epidemiology, public health, and clinical sciences.  Her lab receives funding from NIH, NSF, and USDA.

In addition, she actively works on increasing the diversity in bioinformatics education. Her efforts include being a member of the University of Florida’s Implicit Bias committee, being a panellist for the NSF-funded ACM BCB 2015 Women in Bioinformatics meeting, serving as a faculty advisor for an ACM-W chapter, and being an active member of the Diversity Committee for over three years. She also received a fellowship from The Institute for Learning and Teaching (TILT) for her course redevelopment and served on the advisory committee for an NSF Research Traineeships Program.

She was the PC chair for several conferences, including WABI 2022, SPIRE 2020, RECOMB-SEQ 2019, and ACM-BCB 2018.  Most recently, she was nominated to serve on the NIH BDMA Study Section as a Standing Member, and a member of the Executive Board of ACM SIG BIO.