org.apache.pdfbox.pdfparser
Class NonSequentialPDFParser

java.lang.Object
  extended by org.apache.pdfbox.pdfparser.BaseParser
      extended by org.apache.pdfbox.pdfparser.PDFParser
          extended by org.apache.pdfbox.pdfparser.NonSequentialPDFParser

public class NonSequentialPDFParser
extends PDFParser

PDFParser which first reads startxref and xref tables in order to know valid objects and parse only these objects. Thus it is closer to a conforming parser than the sequential reading of PDFParser. This class can be used as a PDFParser replacement. First parse() must be called before page objects can be retrieved, e.g. getPDDocument(). This class is a much enhanced version of QuickParser presented in PDFBOX-1104 by Jeremy Villalobos.


Field Summary
protected static int DEFAULT_TRAIL_BYTECOUNT
           
protected static char[] EOF_MARKER
          EOF-marker.
protected static char[] OBJ_MARKER
          obj-marker.
protected  SecurityHandler securityHandler
          The security handler.
protected static char[] STARTXREF_MARKER
          StartXRef-marker.
static String SYSPROP_EOFLOOKUPRANGE
           
static String SYSPROP_PARSEMINIMAL
           
static String TMP_FILE_PREFIX
           
 
Fields inherited from class org.apache.pdfbox.pdfparser.PDFParser
isFDFDocment, xrefTrailerResolver
 
Fields inherited from class org.apache.pdfbox.pdfparser.BaseParser
DEF, document, ENDOBJ, ENDSTREAM, forceParsing, pdfSource, PROP_PUSHBACK_SIZE
 
Constructor Summary
NonSequentialPDFParser(File file, RandomAccess raBuf)
          Constructs parser for given file using given buffer for temporary storage.
NonSequentialPDFParser(File file, RandomAccess raBuf, String decryptionPassword)
          Constructs parser for given file using given buffer for temporary storage.
NonSequentialPDFParser(InputStream input)
          Constructor.
NonSequentialPDFParser(InputStream input, RandomAccess raBuf, String decryptionPassword)
          Constructor.
NonSequentialPDFParser(String filename)
          Constructs parser for given file using memory buffer.
 
Method Summary
protected  void decrypt(COSBase pb, int objNr, int objGenNr)
          Decrypts given object.
protected  void decryptDictionary(COSDictionary dict, long objNr, long objGenNr)
           
protected  void decryptString(COSString str, long objNr, long objGenNr)
          Decrypts given COSString.
protected  void deleteTempFile()
          Remove the temporary file.
 PDPage getPage(int pageNr)
          Returns the page requested with all the objects loaded into it.
 int getPageNumber()
          Returns the number of pages in a document.
 PDDocument getPDDocument()
          This will get the PD document that was parsed.
protected  File getPdfFile()
          Return the pdf file.
 SecurityHandler getSecurityHandler()
          Returns security handler of the document or null if document is not encrypted or parse() wasn't called before.
protected  long getStartxrefOffset()
          Looks for and parses startxref.
protected  void initialParse()
          The initial parse will first parse only the trailer, the xrefstart and all xref tables to have a pointer (offset) to all the pdf's objects.
 boolean isLenient()
          Return true if parser is lenient.
protected  int lastIndexOf(char[] pattern, byte[] buf, int endOff)
          Searches last appearance of pattern within buffer.
 void parse()
          This will parse the stream and populate the COSDocument object.
protected  COSStream parseCOSStream(COSDictionary dic, RandomAccess file)
          This will read a COSStream from the input stream using length attribute within dictionary.
protected  COSBase parseObjectDynamically(COSObject obj, boolean requireExistingNotCompressedObj)
          This will parse the next object from the stream and add it to the local state.
protected  COSBase parseObjectDynamically(int objNr, int objGenNr, boolean requireExistingNotCompressedObj)
          This will parse the next object from the stream and add it to the local state.
protected  void readPattern(char[] pattern)
          Reads given pattern from BaseParser.pdfSource.
protected  void releasePdfSourceInputStream()
          Enable handling of alternative pdfSource implementation.
 void setEOFLookupRange(int byteCount)
          Sets how many trailing bytes of PDF file are searched for EOF marker and 'startxref' marker.
 void setLenient(boolean lenient)
          Change the parser leniency flag.
protected  void setPdfSource(long fileOffset)
          Sets BaseParser.pdfSource to start next parsing at given file offset.
 
Methods inherited from class org.apache.pdfbox.pdfparser.PDFParser
clearResources, getDocument, getFDFDocument, isContinueOnError, parseHeader, parseStartXref, parseTrailer, parseXrefStream, parseXrefStream, parseXrefTable, readVersionInTrailer, setTempDirectory
 
Methods inherited from class org.apache.pdfbox.pdfparser.BaseParser
isClosing, isClosing, isEndOfName, isEOL, isEOL, isWhitespace, isWhitespace, parseBoolean, parseCOSArray, parseCOSDictionary, parseCOSName, parseCOSString, parseCOSString, parseDirObject, readExpectedString, readGenerationNumber, readInt, readLine, readLong, readObjectNumber, readString, readString, readStringNumber, readUntilEndStream, setDocument, skipSpaces
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

SYSPROP_PARSEMINIMAL

public static final String SYSPROP_PARSEMINIMAL
See Also:
Constant Field Values

SYSPROP_EOFLOOKUPRANGE

public static final String SYSPROP_EOFLOOKUPRANGE
See Also:
Constant Field Values

DEFAULT_TRAIL_BYTECOUNT

protected static final int DEFAULT_TRAIL_BYTECOUNT
See Also:
Constant Field Values

EOF_MARKER

protected static final char[] EOF_MARKER
EOF-marker.


STARTXREF_MARKER

protected static final char[] STARTXREF_MARKER
StartXRef-marker.


OBJ_MARKER

protected static final char[] OBJ_MARKER
obj-marker.


securityHandler

protected SecurityHandler securityHandler
The security handler.


TMP_FILE_PREFIX

public static final String TMP_FILE_PREFIX
See Also:
Constant Field Values
Constructor Detail

NonSequentialPDFParser

public NonSequentialPDFParser(String filename)
                       throws IOException
Constructs parser for given file using memory buffer.

Parameters:
filename - the filename of the pdf to be parsed
Throws:
IOException - If something went wrong.

NonSequentialPDFParser

public NonSequentialPDFParser(File file,
                              RandomAccess raBuf)
                       throws IOException
Constructs parser for given file using given buffer for temporary storage.

Parameters:
file - the pdf to be parsed
raBuf - the buffer to be used for parsing
Throws:
IOException - If something went wrong.

NonSequentialPDFParser

public NonSequentialPDFParser(File file,
                              RandomAccess raBuf,
                              String decryptionPassword)
                       throws IOException
Constructs parser for given file using given buffer for temporary storage.

Parameters:
file - the pdf to be parsed
raBuf - the buffer to be used for parsing
decryptionPassword - password to be used for decryption
Throws:
IOException - If something went wrong.

NonSequentialPDFParser

public NonSequentialPDFParser(InputStream input)
                       throws IOException
Constructor.

Parameters:
input - input stream representing the pdf.
Throws:
IOException - If something went wrong.

NonSequentialPDFParser

public NonSequentialPDFParser(InputStream input,
                              RandomAccess raBuf,
                              String decryptionPassword)
                       throws IOException
Constructor.

Parameters:
input - input stream representing the pdf.
raBuf - the buffer to be used for parsing
decryptionPassword - password to be used for decryption.
Throws:
IOException - If something went wrong.
Method Detail

setEOFLookupRange

public void setEOFLookupRange(int byteCount)
Sets how many trailing bytes of PDF file are searched for EOF marker and 'startxref' marker. If not set we use default value DEFAULT_TRAIL_BYTECOUNT.

In case system property SYSPROP_EOFLOOKUPRANGE is defined this value will be set on initialization but can be overwritten later.

Parameters:
byteCount - number of trailing bytes

initialParse

protected void initialParse()
                     throws IOException
The initial parse will first parse only the trailer, the xrefstart and all xref tables to have a pointer (offset) to all the pdf's objects. It can handle linearized pdfs, which will have an xref at the end pointing to an xref at the beginning of the file. Last the root object is parsed.

Throws:
IOException - If something went wrong.

setPdfSource

protected final void setPdfSource(long fileOffset)
                           throws IOException
Sets BaseParser.pdfSource to start next parsing at given file offset.

Parameters:
fileOffset - file offset
Throws:
IOException - If something went wrong.

releasePdfSourceInputStream

protected final void releasePdfSourceInputStream()
                                          throws IOException
Enable handling of alternative pdfSource implementation.

Throws:
IOException - If something went wrong.

getStartxrefOffset

protected final long getStartxrefOffset()
                                 throws IOException
Looks for and parses startxref. We first look for last '%%EOF' marker (within last DEFAULT_TRAIL_BYTECOUNT bytes (or range set via setEOFLookupRange(int)) and go back to find startxref.

Returns:
the offset of StartXref
Throws:
IOException - If something went wrong.

lastIndexOf

protected int lastIndexOf(char[] pattern,
                          byte[] buf,
                          int endOff)
Searches last appearance of pattern within buffer. Lookup before _lastOff and goes back until 0.

Parameters:
pattern - pattern to search for
buf - buffer to search pattern in
endOff - offset (exclusive) where lookup starts at
Returns:
start offset of pattern within buffer or -1 if pattern could not be found

readPattern

protected final void readPattern(char[] pattern)
                          throws IOException
Reads given pattern from BaseParser.pdfSource. Skipping whitespace at start and end.

Parameters:
pattern - pattern to be skipped
Throws:
IOException - if pattern could not be read

parse

public void parse()
           throws IOException
This will parse the stream and populate the COSDocument object. This will close the stream when it is done parsing.

Overrides:
parse in class PDFParser
Throws:
IOException - If there is an error reading from the stream or corrupt data is found.

getPdfFile

protected File getPdfFile()
Return the pdf file.

Returns:
the pdf file

isLenient

public boolean isLenient()
Return true if parser is lenient. Meaning auto healing capacity of the parser are used.

Returns:
true if parser is lenient

setLenient

public void setLenient(boolean lenient)
                throws IllegalArgumentException
Change the parser leniency flag. This method can only be called before the parsing of the file.

Parameters:
lenient -
Throws:
IllegalArgumentException - if the method is called after parsing.

deleteTempFile

protected void deleteTempFile()
Remove the temporary file. A temporary file is created if this class is instantiated with an InputStream


getSecurityHandler

public SecurityHandler getSecurityHandler()
Returns security handler of the document or null if document is not encrypted or parse() wasn't called before.

Returns:
the security handler.

getPDDocument

public PDDocument getPDDocument()
                         throws IOException
This will get the PD document that was parsed. When you are done with this document you must call close() on it to release resources. Overwriting super method was necessary in order to set security handler.

Overrides:
getPDDocument in class PDFParser
Returns:
The document at the PD layer.
Throws:
IOException - If there is an error getting the document.

getPageNumber

public int getPageNumber()
                  throws IOException
Returns the number of pages in a document.

Returns:
the number of pages.
Throws:
IOException - if PAGES or other needed object is missing

getPage

public PDPage getPage(int pageNr)
               throws IOException
Returns the page requested with all the objects loaded into it.

Parameters:
pageNr - starts from 0 to the number of pages.
Returns:
the page with the given pagenumber.
Throws:
IOException - If something went wrong.

parseObjectDynamically

protected final COSBase parseObjectDynamically(COSObject obj,
                                               boolean requireExistingNotCompressedObj)
                                        throws IOException
This will parse the next object from the stream and add it to the local state. This is taken from PDFParser and reduced to parsing an indirect object.

Parameters:
obj - object to be parsed (we only take object number and generation number for lookup start offset)
requireExistingNotCompressedObj - if true object to be parsed must not be contained within compressed stream
Returns:
the parsed object (which is also added to document object)
Throws:
IOException - If an IO error occurs.

parseObjectDynamically

protected COSBase parseObjectDynamically(int objNr,
                                         int objGenNr,
                                         boolean requireExistingNotCompressedObj)
                                  throws IOException
This will parse the next object from the stream and add it to the local state. This is taken from PDFParser and reduced to parsing an indirect object.

Parameters:
objNr - object number of object to be parsed
objGenNr - object generation number of object to be parsed
requireExistingNotCompressedObj - if true the object to be parsed must be defined in xref (comment: null objects may be missing from xref) and it must not be a compressed object within object stream (this is used to circumvent being stuck in a loop in a malicious PDF)
Returns:
the parsed object (which is also added to document object)
Throws:
IOException - If an IO error occurs.

decryptDictionary

protected final void decryptDictionary(COSDictionary dict,
                                       long objNr,
                                       long objGenNr)
                                throws IOException
Parameters:
dict - the dictionary to be decrypted
the - object number
objGenNr - the object generation number
Throws:
IOException - ff something went wrong

decryptString

protected final void decryptString(COSString str,
                                   long objNr,
                                   long objGenNr)
                            throws IOException
Decrypts given COSString.

Parameters:
str - the string to be decrypted
objNr - the object number
objGenNr - the object generation number
Throws:
IOException - ff something went wrong

decrypt

protected final void decrypt(COSBase pb,
                             int objNr,
                             int objGenNr)
                      throws IOException
Decrypts given object.

Parameters:
pb - the object to be decrypted
objNr - the object number
objGenNr - the object generation number
Throws:
IOException - ff something went wrong

parseCOSStream

protected COSStream parseCOSStream(COSDictionary dic,
                                   RandomAccess file)
                            throws IOException
This will read a COSStream from the input stream using length attribute within dictionary. If length attribute is a indirect reference it is first resolved to get the stream length. This means we copy stream data without testing for 'endstream' or 'endobj' and thus it is no problem if these keywords occur within stream. We require 'endstream' to be found after stream data is read.

Overrides:
parseCOSStream in class BaseParser
Parameters:
dic - dictionary that goes with this stream.
file - file to write the stream to when reading.
Returns:
parsed pdf stream.
Throws:
IOException - if an error occurred reading the stream, like problems with reading length attribute, stream does not end with 'endstream' after data read, stream too short etc.


Copyright © 2002-2014 The Apache Software Foundation. All Rights Reserved.