public class NonSequentialPDFParser extends PDFParser
PDFParser.
This class can be used as a PDFParser replacement. First parse()
must be called before page objects can be retrieved, e.g. getPDDocument().
This class is a much enhanced version of QuickParser presented in
PDFBOX-1104
by Jeremy Villalobos.| Modifier and Type | Field and Description |
|---|---|
protected static int |
DEFAULT_TRAIL_BYTECOUNT |
protected static char[] |
EOF_MARKER |
protected static char[] |
OBJ_MARKER |
protected SecurityHandler |
securityHandler |
protected static char[] |
STARTXREF_MARKER |
static String |
SYSPROP_EOFLOOKUPRANGE |
static String |
SYSPROP_PARSEMINIMAL |
static String |
TMP_FILE_PREFIX |
xrefTrailerResolverDEF, document, ENDOBJ, ENDSTREAM, FORCE_PARSING, forceParsing, pdfSource, PROP_PUSHBACK_SIZE| Constructor and Description |
|---|
NonSequentialPDFParser(File file,
RandomAccess raBuf)
Constructs parser for given file using given buffer for temporary storage.
|
NonSequentialPDFParser(File file,
RandomAccess raBuf,
String decryptionPassword)
Constructs parser for given file using given buffer for temporary storage.
|
NonSequentialPDFParser(InputStream input) |
NonSequentialPDFParser(String filename)
Constructs parser for given file using memory buffer.
|
| Modifier and Type | Method and Description |
|---|---|
protected void |
decrypt(COSString str,
long objNr,
long objGenNr)
Decrypts given COSString.
|
protected void |
deleteTempFile()
Remove the temporary file.
|
PDPage |
getPage(int pageNr)
Returns the page requested with all the objects loaded into it.
|
int |
getPageNumber()
Returns the number of pages in a document.
|
PDDocument |
getPDDocument()
This will get the PD document that was parsed.
|
protected File |
getPdfFile() |
SecurityHandler |
getSecurityHandler()
Returns security handler of the document or
null if document
is not encrypted or parse() wasn't called before. |
protected long |
getStartxrefOffset()
Looks for and parses startxref.
|
protected void |
initialParse()
The initial parse will first parse only the trailer, the xrefstart and
all xref tables to have a pointer (offset) to all the pdf's objects.
|
protected int |
lastIndexOf(char[] pattern,
byte[] buf,
int endOff)
Searches last appearance of pattern within buffer.
|
void |
parse()
This will parse the stream and populate the COSDocument object.
|
protected COSStream |
parseCOSStream(COSDictionary dic,
RandomAccess file)
This will read a COSStream from the input stream using length attribute
within dictionary.
|
protected COSBase |
parseObjectDynamically(COSObject obj,
boolean requireExistingNotCompressedObj)
This will parse the next object from the stream and add it to
the local state.
|
protected COSBase |
parseObjectDynamically(int objNr,
int objGenNr,
boolean requireExistingNotCompressedObj)
This will parse the next object from the stream and add it to
the local state.
|
protected void |
readPattern(char[] pattern)
Reads given pattern from
BaseParser.pdfSource. |
protected void |
releasePdfSourceInputStream()
Enable handling of alternative pdfSource implementation.
|
void |
setEOFLookupRange(int byteCount)
Sets how many trailing bytes of PDF file are searched for
EOF marker and 'startxref' marker.
|
protected void |
setPdfSource(long fileOffset)
Sets
BaseParser.pdfSource to start next parsing at given file offset. |
getDocument, getFDFDocument, isContinueOnError, parseStartXref, parseTrailer, parseXrefStream, parseXrefTable, setTempDirectoryisClosing, isClosing, isEndOfName, isEOL, isEOL, isWhitespace, isWhitespace, parseBoolean, parseCOSArray, parseCOSDictionary, parseCOSName, parseCOSString, parseDirObject, readExpectedString, readInt, readLine, readString, readString, setDocument, skipSpacespublic static final String SYSPROP_PARSEMINIMAL
public static final String SYSPROP_EOFLOOKUPRANGE
protected static final int DEFAULT_TRAIL_BYTECOUNT
protected static final char[] EOF_MARKER
protected static final char[] STARTXREF_MARKER
protected static final char[] OBJ_MARKER
protected SecurityHandler securityHandler
public static final String TMP_FILE_PREFIX
public NonSequentialPDFParser(String filename) throws IOException
filename - the filename of the pdf to be parsedIOException - If something went wrong.public NonSequentialPDFParser(File file, RandomAccess raBuf) throws IOException
file - the pdf to be parsedraBuf - the buffer to be used for parsingIOException - If something went wrong.public NonSequentialPDFParser(File file, RandomAccess raBuf, String decryptionPassword) throws IOException
file - the pdf to be parsedraBuf - the buffer to be used for parsingdecryptionPassword - password to be used for decryptionIOException - If something went wrong.public NonSequentialPDFParser(InputStream input) throws IOException
IOExceptionpublic void setEOFLookupRange(int byteCount)
DEFAULT_TRAIL_BYTECOUNT.
In case system property SYSPROP_EOFLOOKUPRANGE is defined
this value will be set on initialization but can be overwritten later.
byteCount - number of trailing bytesprotected void initialParse()
throws IOException
IOExceptionprotected final void setPdfSource(long fileOffset)
throws IOException
BaseParser.pdfSource to start next parsing at given file offset.IOExceptionprotected final void releasePdfSourceInputStream()
throws IOException
IOExceptionprotected final long getStartxrefOffset()
throws IOException
DEFAULT_TRAIL_BYTECOUNT bytes (or range set via
setEOFLookupRange(int)) and go back to find startxref.IOExceptionprotected int lastIndexOf(char[] pattern,
byte[] buf,
int endOff)
pattern - pattern to search forbuf - buffer to search pattern inendOff - offset (exclusive) where lookup starts at-1 if pattern could not be foundprotected final void readPattern(char[] pattern)
throws IOException
BaseParser.pdfSource. Skipping whitespace at start and end.IOException - if pattern could not be readpublic void parse()
throws IOException
parse in class PDFParserIOException - If there is an error reading from the stream or corrupt data
is found.protected File getPdfFile()
protected void deleteTempFile()
public SecurityHandler getSecurityHandler()
null if document
is not encrypted or parse() wasn't called before.public PDDocument getPDDocument() throws IOException
getPDDocument in class PDFParserIOException - If there is an error getting the document.public int getPageNumber()
throws IOException
IOException - if PAGES or other needed object is missingpublic PDPage getPage(int pageNr) throws IOException
pageNr - starts from 0 to the number of pages.IOException - If something went wrong.protected final COSBase parseObjectDynamically(COSObject obj, boolean requireExistingNotCompressedObj) throws IOException
PDFParser and reduced to parsing
an indirect object.obj - object to be parsed (we only take object number and generation number for lookup start offset)requireExistingNotCompressedObj - if true object to be parsed must
not be contained within compressed streamIOException - If an IO error occurs.protected COSBase parseObjectDynamically(int objNr, int objGenNr, boolean requireExistingNotCompressedObj) throws IOException
PDFParser and reduced to parsing
an indirect object.objNr - object number of object to be parsedobjGenNr - object generation number of object to be parsedrequireExistingNotCompressedObj - if true the object to be parsed must be defined
in xref (comment: null objects may be missing from xref) and
it must not be a compressed object within object stream
(this is used to circumvent being stuck in a loop in a malicious PDF)IOException - If an IO error occurs.protected final void decrypt(COSString str, long objNr, long objGenNr) throws IOException
IOExceptionprotected COSStream parseCOSStream(COSDictionary dic, RandomAccess file) throws IOException
parseCOSStream in class BaseParserdic - dictionary that goes with this stream.file - file to write the stream to when reading.IOException - if an error occurred reading the stream, like problems
with reading length attribute, stream does not end with 'endstream'
after data read, stream too short etc.Copyright © 2002-2013 The Apache Software Foundation. All Rights Reserved.