PDF Extraction WebApp

 Posted by at 12:30 pm  Uncategorized  Add comments
Mar 132012
 

Background

When I started my secondment, the scientist in charge here at Mendeley and myself agreed on two topics that I will work on in the coming months:

  • Extraction of additional metadata of the authors of a paper, i.e. affiliation and e-mail addresses
  • Extraction of keywords in the full-text of academic publications

Approach

At the last workshop in February in Graz I presented my progress in the area of text extraction out of PDF documents. The main approach has been:

  • Build a stack of unsupervised machine learning algorithms (clustering) to detect text blocks

In the mean time I have been busy to build another processing layer upon the Ildefonso-Clustering text block algorithm to annotate the text from the PDF with the appropriate metadata type. My approach has been:

  • Build a stack of supervised machine learning algorithms (classification) to detect metadata with the text blocks

In the first processing step each text block is classified with one out of these classes: title, subtitle, author, email, affiliation, journal, abstract, other. Then the blocks from the author related classes (author, email, affiliation) are fed to another classification, based on tokens. Each token is assigned a label out of: givenname, middlename, surname, index, separator, affiliation, email, other.

For keyword extraction and summarization I employed the enhanced TextRank algorithm. An evaluation of this approach can be found in TEAM deliverable 1.4.

Demonstration

Finally I have implemented a small web-application to demonstrate the approach, which can be found here:

It should be straightforward to use this little web-application, but please make use of switching between the different models (combo-box in the upper right corner of the PDF display screen). If you find some PDF that work particularly well or bad, or you have any improvement suggestion (algorithm, web-app, …), please let me know.

Evaluation

Finally, here a short overview of the performance of the classifier stack based on a subset of Pubmed. From there 1000 publications were randomly selected for training and another 1000 document for testing.

Text Block Classification

abstract       1993     count    0.92    precision    0.82    recalljournal        316      count    0.82    precision    0.80    recallauthor         949      count    0.95    precision    0.89    recalltitle          915      count    0.94    precision    0.92    recallother          11349    count    0.94    precision    0.96    recallemail          574      count    0.95    precision    0.91    recallaffiliation    783      count    0.87    precision    0.84    recall

This is just the pure classification results, but as not all PDFs contain the relevant information on the first page, the real world numbers for titles are – allowing an maximum edit distance of max(1, floor(len(title)/10)):

979 found, 0.941 precision, 0.921 recall, 0.945 max-recall

Author Metadata Classification

index          4543     count    0,93    precision    0,97    recallother          28308    count    0,94    precision    0,80    recallemail          1425     count    0,95    precision    0,99    recallmiddle-name    2066     count    0,94    precision    0,84    recallsurname        5638     count    0,91    precision    0,91    recallaffiliation    23702    count    0,83    precision    0,95    recallgiven-name     5578     count    0,93    precision    0,96    recallseparator      635      count    0,86    precision    0,94    recall

Again, the real world results deviate from these numbers (now the error from the text extraction and the text block classification stack up). The performance for author names are (here not a single wrong character is allowed, just the given name might be abbreviated):

max-recall=0,876surname    	4638	count	0,88	precision	0,84	recallgivenNames 	4638	count	0,86	precision	0,82	recall

Needless so say, that the ground truth does not always contain the correct information either (miss-spellings in the author names, …).

 Leave a Reply

(required)

(required)


six − = 3

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>