Components¶
-
class
py_pdf_parser.components.
ElementOrdering
¶ A class enumerating the available presets for element_ordering.
-
class
py_pdf_parser.components.
PDFDocument
(pages: Dict[int, Page], pdf_file_path: Optional[str] = None, font_mapping: Optional[Dict[str, str]] = None, font_mapping_is_regex: bool = False, regex_flags: Union[int, re.RegexFlag] = 0, font_size_precision: int = 1, element_ordering: Union[py_pdf_parser.components.ElementOrdering, Callable[[List[T]], List[T]]] = <ElementOrdering.LEFT_TO_RIGHT_TOP_TO_BOTTOM: 1>)¶ Contains all information about the whole pdf document.
To instantiate, you should pass a dictionary mapping page numbers to pages, where each page is a Page namedtuple containing the width and heigh of the page, and a list of pdf elements (which should be directly from PDFMiner, i.e. should be PDFMiner LTComponent`s). On instantiation, the PDFDocument will convert all of these into `PDFElement classes.
Parameters: - pages (dict[int, Page]) – A dictionary mapping page numbers (int) to pages, where pages are a Page namedtuple (containing a width, height and a list of elements from PDFMiner).
- pdf_file_path (str, optional) – A file path to the PDF file. This is optional, and is only used to display your pdf as a background image when using the visualise functions.
- font_mapping (dict, optional) – PDFElement`s have a `font attribute, and the font is taken from the PDF. You can map these fonts to instead use your own internal font names by providing a font_mapping. This is a dictionary with keys being the original font (including font size) and values being your new names.
- font_mapping_is_regex (bool, optional) – Indicates whether font_mapping keys should be considered as regexes. In this case all the fonts will be matched with the regexes. It is only relevant if font_mapping is not None. Default: False.
- regex_flags (str, optional) – Regex flags compatible with the re module. Default: 0.
- font_size_precision (int) – How much rounding to apply to the font size. The font size will be rounded to this many decimal places.
- element_ordering (ElementOrdering or callable, optional) – An ordering function for the elements. Either a member of the ElementOrdering Enum, or a callable which takes a list of elements and returns an ordered list of elements. This will be called separately for each page. Note that the elements in this case will be PDFMiner elements, and not PDFElements from this package.
-
number_of_pages
¶ The total number of pages in the document.
Type: int
-
page_numbers
¶ A list of available page numbers.
Type: list(int)
-
sectioning
¶ Gives access to the sectioning utilities. See the documentation for the Sectioning class.
-
elements
¶ An ElementList containing all elements in the document.
Returns: All elements in the document. Return type: ElementList
-
fonts
¶ A set of all the fonts in the document.
Returns: All the fonts in the document. Return type: set[str]
-
class
py_pdf_parser.components.
PDFElement
(document: PDFDocument, element: LTComponent, index: int, page_number: int, font_size_precision: int = 1)¶ A representation of a single element within the pdf.
You should not instantiate this yourself, but should let the PDFDocument do this.
Parameters: - document (PDFDocument) – A reference to the PDFDocument.
- element (LTComponent) – A PDF Miner LTComponent.
- index (int) – The index of the element within the document.
- page_number (int) – The page number that the element is on.
- font_size_precision (int) – How much rounding to apply to the font size. The font size will be rounded to this many decimal places.
-
original_element
¶ A reference to the original PDF Miner element.
Type: LTComponent
A list of tags that have been added to the element.
Type: set[str]
-
bounding_box
¶ The box representing the location of the element.
Type: BoundingBox
-
add_tag
(new_tag: str) → None¶ Adds the new_tag to the tags set.
Parameters: new_tag (str) – The tag you would like to add.
-
entirely_within
(bounding_box: py_pdf_parser.common.BoundingBox) → bool¶ Whether the entire element is within the bounding box.
Parameters: bounding_box (BoundingBox) – The bounding box to check whether the element is within. Returns: True if the element is entirely contained within the bounding box. Return type: bool
-
font
¶ The name and size of the font, separated by a comma with no spaces.
This will be taken from the pdf itself, using the first character in the element.
If you have provided a font_mapping, this is the string you should map. If the string is mapped in your font_mapping then the mapped value will be returned. font_mapping can have regexes as keys.
Returns: The font of the element. Return type: str
-
font_name
¶ The name of the font.
This will be taken from the pdf itself, using the most common font within all the characters in the element.
Returns: The font name of the element. Return type: str
-
font_size
¶ The size of the font.
This will be taken from the pdf itself, using the most common size within all the characters in the element.
Returns: - The font size of the element, rounded to the font_size_precision of
- the document.
Return type: float
-
ignore
() → None¶ Marks the element as ignored.
The element will no longer be returned in any newly instantiated ElementList. Note that this includes calling any new filter functions on an existing ElementList, since doing so always returns a new ElementList.
-
ignored
¶ A flag specifying whether the element has been ignored.
-
page_number
¶ The page_number of the element in the document.
Returns: The page number of the element. Return type: int
-
partially_within
(bounding_box: py_pdf_parser.common.BoundingBox) → bool¶ Whether any part of the element is within the bounding box.
Parameters: bounding_box (BoundingBox) – The bounding box to check whether the element is partially within. Returns: True if any part of the element is within the bounding box. Return type: bool
-
text
(stripped: bool = True) → str¶ The text contained in the element.
Parameters: stripped (bool, optional) – Whether to strip the text of the element. Default: True. Returns: The text contained in the element. Return type: str
-
class
py_pdf_parser.components.
PDFPage
(document: py_pdf_parser.components.PDFDocument, width: int, height: int, page_number: int, start_element: py_pdf_parser.components.PDFElement, end_element: py_pdf_parser.components.PDFElement)¶ A representation of a page within the PDFDocument.
We store the width, height and page number of the page, along with the first and last element on the page. Because the elements are ordered, this allows us to easily determine all the elements on the page.
Parameters: - document (PDFDocument) – A reference to the PDFDocument.
- width (int) – The width of the page.
- height (int) – The height of the page.
- page_number (int) – The page number.
- start_element (PDFElement) – The first element on the page.
- end_element (PDFElement) – The last element on the page.
-
elements
¶ Returns an ElementList containing all elements on the page.
Returns: All the elements on the page. Return type: ElementList