Proposal for "PDF Reader"

» Metadata » Status
  • Category: File Formats
  • Proposer: John Stokes 
  • License: BSD Style
  • Status: Proposed
» Description
PDF Reader extracts raw text from a PDF file and returns it as an array of strings.

I've seen many solutions to output PDF files, but few to input PDF files as a data source. To my knowledge, there are no PHP-native solutions to read PDF files greater than version 1.4.

PDF Reader supports PDF versions up to 1.7, including AcroForms (aka FDF), and is written as a PHP 5 class tree. It returns raw text as an array of strings or form fields as an associative array of key/value pairs.

I have no plans to extract images or layout metadata at this time, nor do I plan to support signed or encrypted PDFs unless there's demand.
» Dependencies » Links
  • Linux (May work on Windows, but untested)
  • PHP >= 5.1
» Timeline » Changelog
  • First Draft: 2010-07-22
  • Proposal: 2010-07-29
  • John Stokes
    [2010-08-03 23:55 UTC]

    - Fixed a bug in which a string without ET operator would have zero length
    - Added support for hexadecimal strings embedded in normal strings
    - Fixed a bug in which some line breaks are ignored
    - Standardized regular expressions for primitive data types as constants
    - Shortened lines and adjusted switch/case indents for PEAR compliance
    - Moved extractText routines from PDFobject class to PDFpage class in order
    to assemble multiple content streams
    - Refactored to use a single PDFdecoder instance

    - Initial proposal
    - Included basic support for text and form field extraction
    - Some known bugs with character mapping non-standard fonts
  • John Stokes
    [2010-08-05 21:51 UTC]

    - Fixed known bug with character mapping non-standard fonts
    - Added limited support for text matrices
  • John Stokes
    [2010-08-27 19:27 UTC]

    - Fixed bug in which Marked Content operators would sometimes appear in form fields
    - Removed "exit" and "die" statements for PEAR compliance
    - Added error trap for absence of `gzip`
    - Implemented package-specific Exceptions
    - Name changed to File_PDFreader for PEAR compliance
  • John Stokes
    [2010-12-29 21:21 UTC]

    - Added support for cascaded filters
    - Fixed a bug in which stream object dictionaries may be parsed incorrectly
    - Restructuried directories for PEAR compliance
    - Added package.xml for PEAR installer
  • John Stokes
    [2011-01-05 23:00 UTC]


    - Added support for StandardEncoding
    - Added support for PDFDocEncoding
    - Added support for MacExpertEncoding
    - Added support for ASCIIHexDecoded filter
    - Added support for ASCII85Decoded filter
  • John Stokes
    [2011-02-03 01:47 UTC]

    - Fixed a bug in which parent fields without field types throws an exception
    - Fixed a bug in which PDF arrays in form labels were not parsed correctly
    - Fixed a bug in parsing binary Field Flags
    - Brought into E_STRICT compliance
    - All output is now UTF-8 encoded (Thanks to Christoph Runkel for this tip)
  • John Stokes
    [2011-04-21 19:59 UTC]

    - Added a feature to handle nested Page arrays (Thanks again to Christoph Runkel)
    - Added a feature to decode hex-encoded form field keys
    - Added two new methods to the API: readTextByPage($pageNum) which allows users to read one page at a time for large PDFs and getPages() to return the total number of pages in the document.