Components

class py_pdf_parser.components.ElementOrdering

A class enumerating the available presets for element_ordering.

class py_pdf_parser.components.PDFDocument(pages: Dict[int, Page], pdf_file_path: Optional[str] = None, font_mapping: Optional[Dict[str, str]] = None, font_mapping_is_regex: bool = False, regex_flags: int = 0, font_size_precision: int = 1, element_ordering: Union[py_pdf_parser.components.ElementOrdering, Callable[List, List]] = <ElementOrdering.LEFT_TO_RIGHT_TOP_TO_BOTTOM: 1>)

Contains all information about the whole pdf document.

To instantiate, you should pass a dictionary mapping page numbers to pages, where each page is a Page namedtuple containing the width and heigh of the page, and a list of pdf elements (which should be directly from PDFMiner, i.e. should be PDFMiner LTComponent`s). On instantiation, the PDFDocument will convert all of these into `PDFElement classes.

Parameters:
  • pages (dict[int, Page]) – A dictionary mapping page numbers (int) to pages, where pages are a Page namedtuple (containing a width, height and a list of elements from PDFMiner).
  • pdf_file_path (str, optional) – A file path to the PDF file. This is optional, and is only used to display your pdf as a background image when using the visualise functions.
  • font_mapping (dict, optional) – PDFElement`s have a `font attribute, and the font is taken from the PDF. You can map these fonts to instead use your own internal font names by providing a font_mapping. This is a dictionary with keys being the original font (including font size) and values being your new names.
  • font_mapping_is_regex (bool, optional) – Indicates whether font_mapping keys should be considered as regexes. In this case all the fonts will be matched with the regexes. It is only relevant if font_mapping is not None. Default: False.
  • regex_flags (str, optional) – Regex flags compatible with the re module. Default: 0.
  • font_size_precision (int) – How much rounding to apply to the font size. The font size will be rounded to this many decimal places.
  • element_ordering (ElementOrdering or callable, optional) – An ordering function for the elements. Either a member of the ElementOrdering Enum, or a callable which takes a list of elements and returns an ordered list of elements. This will be called separately for each page. Note that the elements in this case will be PDFMiner elements, and not PDFElements from this package.
number_of_pages

The total number of pages in the document.

Type:int
page_numbers

A list of available page numbers.

Type:list(int)
sectioning

Gives access to the sectioning utilities. See the documentation for the Sectioning class.

elements

An ElementList containing all elements in the document.

Returns:All elements in the document.
Return type:ElementList
fonts

A set of all the fonts in the document.

Returns:All the fonts in the document.
Return type:set[str]
get_page(page_number: int) → py_pdf_parser.components.PDFPage

Returns the PDFPage for the specified page_number.

Parameters:page_number (int) – The page number.
Raises:PageNotFoundError – If page_number was not found.
Returns:The requested page.
Return type:PDFPage
pages

A list of all pages in the document.

Returns:All pages in the document.
Return type:list[PDFPage]
class py_pdf_parser.components.PDFElement(document: PDFDocument, element: LTComponent, index: int, page_number: int, font_size_precision: int = 1)

A representation of a single element within the pdf.

You should not instantiate this yourself, but should let the PDFDocument do this.

Parameters:
  • document (PDFDocument) – A reference to the PDFDocument.
  • element (LTComponent) – A PDF Miner LTComponent.
  • index (int) – The index of the element within the document.
  • page_number (int) – The page number that the element is on.
  • font_size_precision (int) – How much rounding to apply to the font size. The font size will be rounded to this many decimal places.
original_element

A reference to the original PDF Miner element.

Type:LTComponent
tags

A list of tags that have been added to the element.

Type:set[str]
bounding_box

The box representing the location of the element.

Type:BoundingBox
add_tag(new_tag: str)

Adds the new_tag to the tags set.

Parameters:new_tag (str) – The tag you would like to add.
entirely_within(bounding_box: py_pdf_parser.common.BoundingBox) → bool

Whether the entire element is within the bounding box.

Parameters:bounding_box (BoundingBox) – The bounding box to check whether the element is within.
Returns:True if the element is entirely contained within the bounding box.
Return type:bool
font

The name and size of the font, separated by a comma with no spaces.

This will be taken from the pdf itself, using the first character in the element.

If you have provided a font_mapping, this is the string you should map. If the string is mapped in your font_mapping then the mapped value will be returned. font_mapping can have regexes as keys.

Returns:The font of the element.
Return type:str
font_name

The name of the font.

This will be taken from the pdf itself, using the most common font within all the characters in the element.

Returns:The font name of the element.
Return type:str
font_size

The size of the font.

This will be taken from the pdf itself, using the most common size within all the characters in the element.

Returns:
The font size of the element, rounded to the font_size_precision of
the document.
Return type:float
ignore()

Marks the element as ignored.

The element will no longer be returned in any newly instantiated ElementList. Note that this includes calling any new filter functions on an existing ElementList, since doing so always returns a new ElementList.

ignored

A flag specifying whether the element has been ignored.

page_number

The page_number of the element in the document.

Returns:The page number of the element.
Return type:int
partially_within(bounding_box: py_pdf_parser.common.BoundingBox) → bool

Whether any part of the element is within the bounding box.

Parameters:bounding_box (BoundingBox) – The bounding box to check whether the element is partially within.
Returns:True if any part of the element is within the bounding box.
Return type:bool
text(stripped: bool = True) → str

The text contained in the element.

Parameters:stripped (bool, optional) – Whether to strip the text of the element. Default: True.
Returns:The text contained in the element.
Return type:str
class py_pdf_parser.components.PDFPage(document: py_pdf_parser.components.PDFDocument, width: int, height: int, page_number: int, start_element: py_pdf_parser.components.PDFElement, end_element: py_pdf_parser.components.PDFElement)

A representation of a page within the PDFDocument.

We store the width, height and page number of the page, along with the first and last element on the page. Because the elements are ordered, this allows us to easily determine all the elements on the page.

Parameters:
  • document (PDFDocument) – A reference to the PDFDocument.
  • width (int) – The width of the page.
  • height (int) – The height of the page.
  • page_number (int) – The page number.
  • start_element (PDFElement) – The first element on the page.
  • end_element (PDFElement) – The last element on the page.
elements

Returns an ElementList containing all elements on the page.

Returns:All the elements on the page.
Return type:ElementList