PDF Extraction WebApp
Background
When I started my secondment, the scientist in charge here at Mendeley and myself agreed on two topics that I will work on in the coming months:
- Extraction of additional metadata of the authors of a paper, i.e. affiliation and e-mail addresses
- Extraction of keywords in the full-text of academic publications
Approach
At the last workshop in February in Graz I presented my progress in the area of text extraction out of PDF documents. The main approach has been:
- Build a stack of unsupervised machine learning algorithms (clustering) to detect text blocks
In the mean time I have been busy to build another processing layer upon the Ildefonso-Clustering text block algorithm to annotate the text from the PDF with the appropriate metadata type. My approach has been:
- Build a stack of supervised machine learning algorithms (classification) to detect metadata with the text blocks
In the first processing step each text block is classified with one out of these classes: title, subtitle, author, email, affiliation, journal, abstract, other. Then the blocks from the author related classes (author, email, affiliation) are fed to another classification, based on tokens. Each token is assigned a label out of: givenname, middlename, surname, index, separator, affiliation, email, other.
For keyword extraction and summarization I employed the enhanced TextRank algorithm. An evaluation of this approach can be found in TEAM deliverable 1.4.
Demonstration
Finally I have implemented a small web-application to demonstrate the approach, which can be found here:
It should be straightforward to use this little web-application, but please make use of switching between the different models (combo-box in the upper right corner of the PDF display screen). If you find some PDF that work particularly well or bad, or you have any improvement suggestion (algorithm, web-app, …), please let me know.
Evaluation
Finally, here a short overview of the performance of the classifier stack based on a subset of Pubmed. From there 1000 publications were randomly selected for training and another 1000 document for testing.
Text Block Classification
abstract 1993 count 0.92 precision 0.82 recalljournal 316 count 0.82 precision 0.80 recallauthor 949 count 0.95 precision 0.89 recalltitle 915 count 0.94 precision 0.92 recallother 11349 count 0.94 precision 0.96 recallemail 574 count 0.95 precision 0.91 recallaffiliation 783 count 0.87 precision 0.84 recall
This is just the pure classification results, but as not all PDFs contain the relevant information on the first page, the real world numbers for titles are – allowing an maximum edit distance of max(1, floor(len(title)/10)):
979 found, 0.941 precision, 0.921 recall, 0.945 max-recall
Author Metadata Classification
index 4543 count 0,93 precision 0,97 recallother 28308 count 0,94 precision 0,80 recallemail 1425 count 0,95 precision 0,99 recallmiddle-name 2066 count 0,94 precision 0,84 recallsurname 5638 count 0,91 precision 0,91 recallaffiliation 23702 count 0,83 precision 0,95 recallgiven-name 5578 count 0,93 precision 0,96 recallseparator 635 count 0,86 precision 0,94 recall
Again, the real world results deviate from these numbers (now the error from the text extraction and the text block classification stack up). The performance for author names are (here not a single wrong character is allowed, just the given name might be abbreviated):
max-recall=0,876surname 4638 count 0,88 precision 0,84 recallgivenNames 4638 count 0,86 precision 0,82 recall
Needless so say, that the ground truth does not always contain the correct information either (miss-spellings in the author names, …).