parser module

class pdf4py.parser.Parser(source, password=None)

Parse a PDF document to retrieve PDF objects composing it.

The constructor takes as argument an object source, the sequence of bytes the PDF document is encoded into. It can be of type bytes, bytearray or file pointer opened for reading in binary mode. Optionally, the second argument is the password to be provided if the document is protected through encryption (if encrypted with AESV3, the password is of type str, else bytes). For example,

>>> from pdf4py.parser import Parser
>>> with open('path/to/file.pdf', 'rb') as fp:
>>>     parser = Parser(fp)

Creates a new instance of Parser. The constructor reads the Cross Reference Table of the PDF document to retrieve the list of PDF objects that are present and parsable in the document. The Cross Reference Table is then available as attribute of the newly created Parser instance. For more information about the cross reference table, see the XRefTable documentation.

After the instantiation, parser will have a XRefTable instance associated to the attribute xreftable. To retrieve PDF objects pass entries in the table to the Parser.parse_reference method.

parse_reference

Parse and retrieve the PDF object xref_entry points to.

Notes

PDF objects are not parsed when an instance of Parser is being created. Instead, parsing occurs when this method is called. To avoid that the same object is being parsed too many times, a LRU cache is being used to keep in memory the last 256 parsed objects.

Parameters:reference (XrefInUseEntry or XrefCompressedEntry or PDFReference) – An entry in the XRefTable or a PDFReference object pointing to a PDFObject within the file that has to be parsed.
Returns:obj – The parsed PDF object.
Return type:one of the types used to represent a PDF object.
Raises:ValueError if reference object type is not a valid one.
class pdf4py.parser.SequentialParser(source, **kwargs)

Implements a parser that is able to parse a PDF objects by scanning the input bytes sequence.

In other words, objects are extracted in the order they appear in the stream. For this reason it is used to parse Content Streams.

Note that this class is not able to parse a complete PDF file since the process requires random access in the file to retrieve information when required (for example to resolve a reference pointing at the Integer holding the length of a stream). However, this class is used in defining the more powerful Parser.

The constructor that must be used by users takes a positional argument, source, being the source bytes stream. It can by a byte, bytearray or a file pointer opened in binary mode. Other keyword arguments are used internally in pdf4y, specifically by the Parser class.

parse_object(obj_num: Optional[tuple] = None)

Parse the next PDF object from the token stream.

Parameters:obj_num (tuple) – Tuple (seq, gen), seq and gen being the sequence and the generation number of the object that is going to be parsed respectively. These values are known when the parsing action is instructed after a XRefTable lookup. This parameter is used only by the Parser class when the PDF is encrypted.
Returns:obj – The parsed PDF object.
Return type:one of the PDF types defined in module types
class pdf4py.parser.XRefTable(previous: pdf4py.parser.XRefTable, inuse_objects: dict, free_objects: set, compressed_objects: Optional[dict] = None)

Implements the functionalities of a Cross Reference Table.

The Cross Reference Table (XRefTable) is the index of all the PDF objects in a PDF file. An object is uniquely identified with a tuple (s, g) where s is the sequence number and g is the generation number. There are mainly two types of entries in such table:

  • XrefInUseEntry entries that represent objects that are part of the PDF document’s current structure, and
  • tuple entries pointing at free objects, objects that are no longer used (for example, they have been eliminated in a modification of the document).
  • XrefCompressedEntry entries that are objects in use but stored in a compressed stream.

The listed three object types are to be used with the Parser.parse_reference class method to actually retrieve the associated object.

There are two main ways to query a XRefTable instance:

  • Iterating over the instance itself to get references to in use and compressed objects (but not free objects).
  • Accessing a particular entry using the square brackets. A bidimentional index is used, representing the sequence and generation numbers. This is because it implements the __getitem__ method that is used by the parser to look up objects if required during the parsing process.
previous

Points to the XRefTable instance that is associated to the /Prev key in the trailer dictionary of the current cross-reference table.