Opening Books and the National Corpus of Graduate Research

Virginia Tech University Libraries, in collaboration with Virginia Tech Computer Science and Old Dominion University Computer Science, will bring computational access to book-length documents, through a research and piloting effort employing Electronic Theses and Dissertations (ETDs). The library and archives fields lack research on extracting and analyzing segments of long documents (chapters, reference lists, tables, figures), as well as methods for summarizing individual chapters of longer texts to enable findability. The project brings cutting-edge computer science and machine learning technologies to advance discovery, use, and potential for reuse of the knowledge hidden in the text of books and book-length documents. By focusing on libraries' ETD collections, the research will enhance libraries' ETD programs, devising effective and efficient methods for opening the knowledge currently hidden in the rich body of graduate research and scholarship.

ETD_Project_Overview — Stakeholders and ETD services

The following are the research questions that we aim to answer:

RQ1: How can we effectively identify and extract key parts (chapters, sections, tables, figures, citations),
in both born digital and page image formats?
RQ2: How can we develop effective automatic classification as well as chapter summarization techniques?
RQ3: How can our ETD digital library most effectively serve stakeholders?

Our methods can potentially be applied to a vast amount of other long documents. The services can benefit many types of stakeholders (as indicated in the figure).

Grant Details

Members