|
||||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | |||||||||
java.lang.Objectorg.apache.pdfbox.pdfparser.BaseParser
org.apache.pdfbox.pdfparser.PDFParser
org.apache.pdfbox.pdfparser.NonSequentialPDFParser
public class NonSequentialPDFParser
PDFParser which first reads startxref and xref tables in order to know valid
objects and parse only these objects. Thus it is closer to a conforming
parser than the sequential reading of PDFParser.
This class can be used as a PDFParser replacement. First
parse() must be called before page objects can be retrieved, e.g.
getPDDocument().
This class is a much enhanced version of QuickParser presented
in PDFBOX-1104 by
Jeremy Villalobos.
| Field Summary | |
|---|---|
protected static int |
DEFAULT_TRAIL_BYTECOUNT
|
protected static char[] |
EOF_MARKER
EOF-marker. |
protected static char[] |
OBJ_MARKER
obj-marker. |
protected SecurityHandler |
securityHandler
The security handler. |
protected static char[] |
STARTXREF_MARKER
StartXRef-marker. |
static String |
SYSPROP_EOFLOOKUPRANGE
|
static String |
SYSPROP_PARSEMINIMAL
|
static String |
TMP_FILE_PREFIX
|
| Fields inherited from class org.apache.pdfbox.pdfparser.PDFParser |
|---|
isFDFDocment, xrefTrailerResolver |
| Fields inherited from class org.apache.pdfbox.pdfparser.BaseParser |
|---|
DEF, document, ENDOBJ, ENDSTREAM, forceParsing, pdfSource, PROP_PUSHBACK_SIZE |
| Constructor Summary | |
|---|---|
NonSequentialPDFParser(File file,
RandomAccess raBuf)
Constructs parser for given file using given buffer for temporary storage. |
|
NonSequentialPDFParser(File file,
RandomAccess raBuf,
String decryptionPassword)
Constructs parser for given file using given buffer for temporary storage. |
|
NonSequentialPDFParser(InputStream input)
Constructor. |
|
NonSequentialPDFParser(InputStream input,
RandomAccess raBuf,
String decryptionPassword)
Constructor. |
|
NonSequentialPDFParser(String filename)
Constructs parser for given file using memory buffer. |
|
| Method Summary | |
|---|---|
protected void |
decrypt(COSBase pb,
int objNr,
int objGenNr)
Decrypts given object. |
protected void |
decryptDictionary(COSDictionary dict,
long objNr,
long objGenNr)
|
protected void |
decryptString(COSString str,
long objNr,
long objGenNr)
Decrypts given COSString. |
protected void |
deleteTempFile()
Remove the temporary file. |
PDPage |
getPage(int pageNr)
Returns the page requested with all the objects loaded into it. |
int |
getPageNumber()
Returns the number of pages in a document. |
PDDocument |
getPDDocument()
This will get the PD document that was parsed. |
protected File |
getPdfFile()
Return the pdf file. |
SecurityHandler |
getSecurityHandler()
Returns security handler of the document or null if document
is not encrypted or parse() wasn't called before. |
protected long |
getStartxrefOffset()
Looks for and parses startxref. |
protected void |
initialParse()
The initial parse will first parse only the trailer, the xrefstart and all xref tables to have a pointer (offset) to all the pdf's objects. |
boolean |
isLenient()
Return true if parser is lenient. |
protected int |
lastIndexOf(char[] pattern,
byte[] buf,
int endOff)
Searches last appearance of pattern within buffer. |
void |
parse()
This will parse the stream and populate the COSDocument object. |
protected COSStream |
parseCOSStream(COSDictionary dic,
RandomAccess file)
This will read a COSStream from the input stream using length attribute within dictionary. |
protected COSBase |
parseObjectDynamically(COSObject obj,
boolean requireExistingNotCompressedObj)
This will parse the next object from the stream and add it to the local state. |
protected COSBase |
parseObjectDynamically(int objNr,
int objGenNr,
boolean requireExistingNotCompressedObj)
This will parse the next object from the stream and add it to the local state. |
protected void |
readPattern(char[] pattern)
Reads given pattern from BaseParser.pdfSource. |
protected void |
releasePdfSourceInputStream()
Enable handling of alternative pdfSource implementation. |
void |
setEOFLookupRange(int byteCount)
Sets how many trailing bytes of PDF file are searched for EOF marker and 'startxref' marker. |
void |
setLenient(boolean lenient)
Change the parser leniency flag. |
protected void |
setPdfSource(long fileOffset)
Sets BaseParser.pdfSource to start next parsing at given file offset. |
| Methods inherited from class org.apache.pdfbox.pdfparser.PDFParser |
|---|
clearResources, getDocument, getFDFDocument, isContinueOnError, parseHeader, parseStartXref, parseTrailer, parseXrefStream, parseXrefStream, parseXrefTable, readVersionInTrailer, setTempDirectory |
| Methods inherited from class org.apache.pdfbox.pdfparser.BaseParser |
|---|
isClosing, isClosing, isEndOfName, isEOL, isEOL, isWhitespace, isWhitespace, parseBoolean, parseCOSArray, parseCOSDictionary, parseCOSName, parseCOSString, parseCOSString, parseDirObject, readExpectedString, readGenerationNumber, readInt, readLine, readLong, readObjectNumber, readString, readString, readStringNumber, readUntilEndStream, setDocument, skipSpaces |
| Methods inherited from class java.lang.Object |
|---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
| Field Detail |
|---|
public static final String SYSPROP_PARSEMINIMAL
public static final String SYSPROP_EOFLOOKUPRANGE
protected static final int DEFAULT_TRAIL_BYTECOUNT
protected static final char[] EOF_MARKER
protected static final char[] STARTXREF_MARKER
protected static final char[] OBJ_MARKER
protected SecurityHandler securityHandler
public static final String TMP_FILE_PREFIX
| Constructor Detail |
|---|
public NonSequentialPDFParser(String filename)
throws IOException
filename - the filename of the pdf to be parsed
IOException - If something went wrong.
public NonSequentialPDFParser(File file,
RandomAccess raBuf)
throws IOException
file - the pdf to be parsedraBuf - the buffer to be used for parsing
IOException - If something went wrong.
public NonSequentialPDFParser(File file,
RandomAccess raBuf,
String decryptionPassword)
throws IOException
file - the pdf to be parsedraBuf - the buffer to be used for parsingdecryptionPassword - password to be used for decryption
IOException - If something went wrong.
public NonSequentialPDFParser(InputStream input)
throws IOException
input - input stream representing the pdf.
IOException - If something went wrong.
public NonSequentialPDFParser(InputStream input,
RandomAccess raBuf,
String decryptionPassword)
throws IOException
input - input stream representing the pdf.raBuf - the buffer to be used for parsingdecryptionPassword - password to be used for decryption.
IOException - If something went wrong.| Method Detail |
|---|
public void setEOFLookupRange(int byteCount)
DEFAULT_TRAIL_BYTECOUNT.
In case system property SYSPROP_EOFLOOKUPRANGE is defined
this value will be set on initialization but can be overwritten
later.
byteCount - number of trailing bytes
protected void initialParse()
throws IOException
IOException - If something went wrong.
protected final void setPdfSource(long fileOffset)
throws IOException
BaseParser.pdfSource to start next parsing at given file offset.
fileOffset - file offset
IOException - If something went wrong.
protected final void releasePdfSourceInputStream()
throws IOException
IOException - If something went wrong.
protected final long getStartxrefOffset()
throws IOException
DEFAULT_TRAIL_BYTECOUNT bytes (or range set via
setEOFLookupRange(int)) and go back to find
startxref.
IOException - If something went wrong.
protected int lastIndexOf(char[] pattern,
byte[] buf,
int endOff)
pattern - pattern to search forbuf - buffer to search pattern inendOff - offset (exclusive) where lookup starts at
-1 if
pattern could not be found
protected final void readPattern(char[] pattern)
throws IOException
BaseParser.pdfSource. Skipping whitespace at start
and end.
pattern - pattern to be skipped
IOException - if pattern could not be read
public void parse()
throws IOException
parse in class PDFParserIOException - If there is an error reading from the stream or corrupt data
is found.protected File getPdfFile()
public boolean isLenient()
public void setLenient(boolean lenient)
throws IllegalArgumentException
lenient -
IllegalArgumentException - if the method is called after parsing.protected void deleteTempFile()
public SecurityHandler getSecurityHandler()
null if document
is not encrypted or parse() wasn't called before.
public PDDocument getPDDocument()
throws IOException
getPDDocument in class PDFParserIOException - If there is an error getting the document.
public int getPageNumber()
throws IOException
IOException - if PAGES or other needed object is missing
public PDPage getPage(int pageNr)
throws IOException
pageNr - starts from 0 to the number of pages.
IOException - If something went wrong.
protected final COSBase parseObjectDynamically(COSObject obj,
boolean requireExistingNotCompressedObj)
throws IOException
PDFParser and reduced to parsing an
indirect object.
obj - object to be parsed (we only take object number and generation
number for lookup start offset)requireExistingNotCompressedObj - if true object to be
parsed must not be contained within compressed stream
IOException - If an IO error occurs.
protected COSBase parseObjectDynamically(int objNr,
int objGenNr,
boolean requireExistingNotCompressedObj)
throws IOException
PDFParser and reduced to parsing an
indirect object.
objNr - object number of object to be parsedobjGenNr - object generation number of object to be parsedrequireExistingNotCompressedObj - if true the object to
be parsed must be defined in xref (comment: null objects may
be missing from xref) and it must not be a compressed object
within object stream (this is used to circumvent being stuck
in a loop in a malicious PDF)
IOException - If an IO error occurs.
protected final void decryptDictionary(COSDictionary dict,
long objNr,
long objGenNr)
throws IOException
dict - the dictionary to be decryptedthe - object numberobjGenNr - the object generation number
IOException - ff something went wrong
protected final void decryptString(COSString str,
long objNr,
long objGenNr)
throws IOException
str - the string to be decryptedobjNr - the object numberobjGenNr - the object generation number
IOException - ff something went wrong
protected final void decrypt(COSBase pb,
int objNr,
int objGenNr)
throws IOException
pb - the object to be decryptedobjNr - the object numberobjGenNr - the object generation number
IOException - ff something went wrong
protected COSStream parseCOSStream(COSDictionary dic,
RandomAccess file)
throws IOException
parseCOSStream in class BaseParserdic - dictionary that goes with this stream.file - file to write the stream to when reading.
IOException - if an error occurred reading the stream, like
problems with reading length attribute, stream does not end
with 'endstream' after data read, stream too short etc.
|
||||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | |||||||||