Filtering

class py_pdf_parser.filtering.ElementList(document: PDFDocument, indexes: Union[Set[int], FrozenSet[int], None] = None)

Used to represent a list of elements, and to enable filtering of those elements.

Any time you have a group of elements, for example pdf_document.elements or page.elements, you will get an ElementList. You can iterate through this, and also access specific elements. On top of this, there are lots of methods which you can use to further filter your elements. Since all of these methods return a new ElementList, you can chain these operations.

Internally, we keep a set of indexes corresponding to the PDFElements in the document. This means you can treat ElementLists like sets to combine different ElementLists together.

We often implement pluralised versions of methods, which is a shortcut to applying the or operator | to multiple ElementLists with the singular version applied, for example foo.filter_by_tags(“bar”, “baz”) is the same as foo.filter_by_tag(“bar”) | foo.filter_by_tag(“baz”).

Similarly, chaining two filter commands is the same as applying the & operator, for example foo.filter_by_tag(“bar”).filter_by_tag(“baz”) is the same as foo.filter_by_tag(“bar”) & foo.filter_by_tag(“baz”). Note that this is not the case for methods which do not filter, e.g. add_element.

Ignored elements will be excluded on instantiation. Each time you chain a new filter a new ElementList is returned. Note this will remove newly-ignored elements.

Note

As ElementList is implemented using sets internally, you will not be able to have an element in an ElementList multiple times.

Parameters:
  • document (PDFDocument) – A reference to the PDF document
  • indexes (set, optional) – A set (or frozenset) of element indexes. Defaults to all elements in the document.
document

A reference to the PDF document.

Type:PDFDocument
indexes

A frozenset of element indexes.

Type:set, optional
__and__(other: py_pdf_parser.filtering.ElementList) → py_pdf_parser.filtering.ElementList

Returns an ElementList of elements that are in both ElementList

__contains__(element: PDFElement) → bool

Returns True if the element is in the ElementList, otherwise False.

__eq__(other: object) → bool

Returns True if the two ElementLists contain the same elements from the same document.

__getitem__(key: Union[int, slice]) → Union[PDFElement, ElementList]

Returns the element in position key of the ElementList if an int is given, or returns a new ElementList if a slice is given.

Elements are ordered by their original positions in the document, which is left-to-right, top-to-bottom (the same you you read).

__hash__() → int

Return hash(self).

__init__(document: PDFDocument, indexes: Union[Set[int], FrozenSet[int], None] = None)

Initialize self. See help(type(self)) for accurate signature.

__iter__() → py_pdf_parser.filtering.ElementIterator

Returns an ElementIterator class that allows iterating through elements.

Elements will be returned in order of the elements in the document, left-to-right, top-to-bottom (the same as you read).

__len__() → int

Returns the number of elements in the ElementList.

__or__(other: py_pdf_parser.filtering.ElementList) → py_pdf_parser.filtering.ElementList

Returns an ElementList of elements that are in either ElementList

__repr__() → str

Return repr(self).

__sub__(other: py_pdf_parser.filtering.ElementList) → py_pdf_parser.filtering.ElementList

Returns an ElementList of elements that are in the first ElementList but not in the second.

__weakref__

list of weak references to the object (if defined)

__xor__(other: py_pdf_parser.filtering.ElementList) → py_pdf_parser.filtering.ElementList

Returns an ElementList of elements that are in either ElementList, but not both.

above(element: PDFElement, inclusive: bool = False, all_pages: bool = False, tolerance: float = 0.0) → ElementList

Returns all elements which are above the given element.

If you draw a box from the bottom edge of the element to the bottom of the page, all elements which are partially within this box are returned. By default, only elements on the same page as the given element are included, but you can pass inclusive=True to also include the pages which come before (and so are above) the page containing the given element.

Note

By “above” we really mean “directly above”, i.e. the returned elements all have at least some part which is horizontally aligned with the specified element.

Note

Technically the element you specify will satisfy the condition, but we assume you do not want that element returned. If you do, you can pass inclusive=True.

Parameters:
  • element (PDFElement) – The element in question.
  • inclusive (bool, optional) – Whether the include element in the returned results. Default: False.
  • all_pages (bool, optional) – Whether to included pages other than the page which the element is on.
  • tolerance (int, optional) – To be counted as above, the elements must overlap by at least tolerance on the X axis. Tolerance is capped at half the width of the element. Default 0.
Returns:

The filtered list.

Return type:

ElementList

add_element(element: PDFElement) → ElementList

Explicitly adds the element to the ElementList.

Note

If the element is already in the ElementList, this does nothing.

Parameters:element (PDFElement) – The element to add.
Returns:A new list with the additional element.
Return type:ElementList
add_elements(*elements) → ElementList

Explicitly adds the elements to the ElementList.

Note

If the elements is already in the ElementList, this does nothing.

Parameters:*elements (PDFElement) – The elements to add.
Returns:A new list with the additional elements.
Return type:ElementList
add_tag_to_elements(tag: str) → None

Adds a tag to all elements in the list.

Parameters:tag (str) – The tag you would like to add.
after(element: PDFElement, inclusive: bool = False) → ElementList

Returns all elements after the specified element.

By after, we mean succeeding elements according to their index. The PDFDocument will order elements according to the specified element_ordering (which defaults to left to right, top to bottom).

Parameters:
  • element (PDFElement) – The element in question.
  • inclusive (bool, optional) – Whether the include element in the returned results. Default: False.
Returns:

The filtered list.

Return type:

ElementList

before(element: PDFElement, inclusive: bool = False) → ElementList

Returns all elements before the specified element.

By before, we mean preceding elements according to their index. The PDFDocument will order elements according to the specified element_ordering (which defaults to left to right, top to bottom).

Parameters:
  • element (PDFElement) – The element in question.
  • inclusive (bool, optional) – Whether the include element in the returned results. Default: False.
Returns:

The filtered list.

Return type:

ElementList

below(element: PDFElement, inclusive: bool = False, all_pages: bool = False, tolerance: float = 0.0) → ElementList

Returns all elements which are below the given element.

If you draw a box from the bottom edge of the element to the bottom of the page, all elements which are partially within this box are returned. By default, only elements on the same page as the given element are included, but you can pass inclusive=True to also include the pages which come after (and so are below) the page containing the given element.

Note

By “below” we really mean “directly below”, i.e. the returned elements all have at least some part which is horizontally aligned with the specified element.

Note

Technically the element you specify will satisfy the condition, but we assume you do not want that element returned. If you do, you can pass inclusive=True.

Parameters:
  • element (PDFElement) – The element in question.
  • inclusive (bool, optional) – Whether the include element in the returned results. Default: False.
  • all_pages (bool, optional) – Whether to included pages other than the page which the element is on.
  • tolerance (int, optional) – To be counted as below, the elements must overlap by at least tolerance on the X axis. Tolerance is capped at half the width of the element. Default 0.
Returns:

The filtered list.

Return type:

ElementList

between(start_element: PDFElement, end_element: PDFElement, inclusive: bool = False) → ElementList

Returns all elements between the start and end elements.

This is done according to the element indexes. The PDFDocument will order elements according to the specified element_ordering (which defaults to left to right, top to bottom).

This is the same as applying before with start_element and after with end_element.

Parameters:
  • start_element (PDFElement) – Returned elements will be after this element.
  • end_element (PDFElement) – Returned elements will be before this element.
  • inclusive (bool, optional) – Whether the include start_element and end_element in the returned results. Default: False.
Returns:

The filtered list.

Return type:

ElementList

extract_single_element() → PDFElement

Returns only element in the ElementList, provided there is only one element.

This is mainly for convenience, when you think you’ve filtered down to a single element and you would like to extract said element.

Raises:
  • NoElementFoundError – If there are no elements in the ElementList
  • MultipleElementsFoundError – If there is more than one element in the ElementList
Returns:

The single element remaining in the list.

Return type:

PDFElement

filter_by_font(font: str) → py_pdf_parser.filtering.ElementList

Filter for elements containing only the given font.

Parameters:font (str) – The font to filter for.
Returns:The filtered list.
Return type:ElementList
filter_by_fonts(*fonts) → py_pdf_parser.filtering.ElementList

Filter for elements containing only the given font.

Parameters:*fonts (str) – The fonts to filter for.
Returns:The filtered list.
Return type:ElementList
filter_by_page(page_number: int) → py_pdf_parser.filtering.ElementList

Filter for elements on the given page.

Parameters:page (int) – The page to filter for.
Returns:The filtered list.
Return type:ElementList
filter_by_pages(*page_numbers) → py_pdf_parser.filtering.ElementList

Filter for elements on any of the given pages.

Parameters:*pages (int) – The pages to filter for.
Returns:The filtered list.
Return type:ElementList
filter_by_regex(regex: str, regex_flags: Union[int, re.RegexFlag] = 0, stripped: bool = True) → py_pdf_parser.filtering.ElementList

Filter for elements given a regular expression.

Parameters:
  • regex (str) – The regex to filter for.
  • regex_flags (str, optional) – Regex flags compatible with the re module. Default: 0.
  • stripped (bool, optional) – Whether to strip the text of the element before comparison. Default: True.
Returns:

The filtered list.

Return type:

ElementList

filter_by_section(section_str: str) → py_pdf_parser.filtering.ElementList

Filter for elements within the given section.

See the sectioning documentation for more details.

Parameters:section_name (str) – The section to filter for.

Note

You need to specify an exact section, not just the name (i.e. “foo_0” not just “foo”).

Returns:The filtered list.
Return type:ElementList
filter_by_section_name(section_name: str) → py_pdf_parser.filtering.ElementList

Filter for elements within any section with the given name.

See the sectioning documentation for more details.

Parameters:section_name (str) – The section name to filter for.
Returns:The filtered list.
Return type:ElementList
filter_by_section_names(*section_names) → py_pdf_parser.filtering.ElementList

Filter for elements within any section with any of the given names.

See the sectioning documentation for more details.

Parameters:*section_names (str) – The section names to filter for.
Returns:The filtered list.
Return type:ElementList
filter_by_sections(*section_strs) → py_pdf_parser.filtering.ElementList

Filter for elements within any of the given sections.

See the sectioning documentation for more details.

Parameters:*section_names (str) – The sections to filter for.

Note

You need to specify an exact section, not just the name (i.e. “foo_0” not just “foo”).

Returns:The filtered list.
Return type:ElementList
filter_by_tag(tag: str) → py_pdf_parser.filtering.ElementList

Filter for elements containing only the given tag.

Parameters:tag (str) – The tag to filter for.
Returns:The filtered list.
Return type:ElementList
filter_by_tags(*tags) → py_pdf_parser.filtering.ElementList

Filter for elements containing any of the given tags.

Parameters:*tags (str) – The tags to filter for.
Returns:The filtered list.
Return type:ElementList
filter_by_text_contains(text: str) → py_pdf_parser.filtering.ElementList

Filter for elements whose text contains the given string.

Parameters:text (str) – The text to filter for.
Returns:The filtered list.
Return type:ElementList
filter_by_text_equal(text: str, stripped: bool = True) → py_pdf_parser.filtering.ElementList

Filter for elements whose text is exactly the given string.

Parameters:
  • text (str) – The text to filter for.
  • stripped (bool, optional) – Whether to strip the text of the element before comparison. Default: True.
Returns:

The filtered list.

Return type:

ElementList

filter_partially_within_bounding_box(bounding_box: py_pdf_parser.common.BoundingBox, page_number: int) → py_pdf_parser.filtering.ElementList

Returns all elements on the given page which are partially within the given box.

Parameters:
  • bounding_box (BoundingBox) – The bounding box to filter within.
  • page_number (int) – The page which you’d like to filter within the box.
Returns:

The filtered list.

Return type:

ElementList

horizontally_in_line_with(element: PDFElement, inclusive: bool = False, tolerance: float = 0.0) → ElementList

Returns all elements which are horizontally in line with the given element.

If you extend the top and bottom edges of the element to the left and right of the page, all elements which are partially within this box are returned.

This is equivalent to doing foo.to_the_left_of(…) | foo.to_the_right_of(…).

Note

Technically the element you specify will satisfy the condition, but we assume you do not want that element returned. If you do, you can pass inclusive=True.

Parameters:
  • element (PDFElement) – The element in question.
  • inclusive (bool, optional) – Whether the include element in the returned results. Default: False.
  • tolerance (int, optional) – To be counted as in line with, the elements must overlap by at least tolerance on the Y axis. Tolerance is capped at half the width of the element. Default 0.
Returns:

The filtered list.

Return type:

ElementList

ignore_elements() → None

Marks all the elements in the ElementList as ignored.

move_backwards_from(element: PDFElement, count: int = 1, capped: bool = False) → PDFElement

Returns the element in the element list obtained by moving backwards from element by count.

Parameters:
  • element (PDFElement) – The element to start at.
  • count (int, optional) – How many elements to move from element. The default of 1 will move backwards by one element. Passing 0 will simply return the element itself. You can also pass negative integers to move forwards.
  • capped (bool, optional) – By default (False), if the count is high enough that we try to move out of range of the list, an exception will be raised. Passing capped=True will change this behaviour to instead return the element at the start or end of the list.
Raises:

ElementOutOfRangeError – If the count is large (or large-negative) enough that we reach the start (or end) of the list. Only happens when capped=False.

move_forwards_from(element: PDFElement, count: int = 1, capped: bool = False) → PDFElement

Returns the element in the element list obtained by moving forwards from element by count.

Parameters:
  • element (PDFElement) – The element to start at.
  • count (int, optional) – How many elements to move from element. The default of 1 will move forwards by one element. Passing 0 will simply return the element itself. You can also pass negative integers to move backwards.
  • capped (bool, optional) – By default (False), if the count is high enough that we try to move out of range of the list, an exception will be raised. Passing capped=True will change this behaviour to instead return the element at the start or end of the list.
Raises:

ElementOutOfRangeError – If the count is large (or large-negative) enough that we reach the end (or start) of the list. Only happens when capped=False.

remove_element(element: PDFElement) → ElementList

Explicitly removes the element from the ElementList.

Note

If the element is not in the ElementList, this does nothing.

Parameters:element (PDFElement) – The element to remove.
Returns:A new list without the element.
Return type:ElementList
remove_elements(*elements) → ElementList

Explicitly removes the elements from the ElementList.

Note

If the elements are not in the ElementList, this does nothing.

Parameters:*elements (PDFElement) – The elements to remove.
Returns:A new list without the elements.
Return type:ElementList
to_the_left_of(element: PDFElement, inclusive: bool = False, tolerance: float = 0.0) → ElementList

Filter for elements which are to the left of the given element.

If you draw a box from the left hand edge of the element to the left hand side of the page, all elements which are partially within this box are returned.

Note

By “to the left of” we really mean “directly to the left of”, i.e. the returned elements all have at least some part which is vertically aligned with the specified element.

Note

Technically the element you specify will satisfy the condition, but we assume you do not want that element returned. If you do, you can pass inclusive=True.

Parameters:
  • element (PDFElement) – The element in question.
  • inclusive (bool, optional) – Whether the include element in the returned results. Default: False.
  • tolerance (int, optional) – To be counted as to the left, the elements must overlap by at least tolerance on the Y axis. Tolerance is capped at half the height of the element. Default 0.
Returns:

The filtered list.

Return type:

ElementList

to_the_right_of(element: PDFElement, inclusive: bool = False, tolerance: float = 0.0) → ElementList

Filter for elements which are to the right of the given element.

If you draw a box from the right hand edge of the element to the right hand side of the page, all elements which are partially within this box are returned.

Note

By “to the right of” we really mean “directly to the right of”, i.e. the returned elements all have at least some part which is vertically aligned with the specified element.

Note

Technically the element you specify will satisfy the condition, but we assume you do not want that element returned. If you do, you can pass inclusive=True.

Parameters:
  • element (PDFElement) – The element in question.
  • inclusive (bool, optional) – Whether the include element in the returned results. Default: False.
  • tolerance (int, optional) – To be counted as to the right, the elements must overlap by at least tolerance on the Y axis. Tolerance is capped at half the height of the element. Default 0.
Returns:

The filtered list.

Return type:

ElementList

vertically_in_line_with(element: PDFElement, inclusive: bool = False, all_pages: bool = False, tolerance: float = 0.0) → ElementList

Returns all elements which are vertically in line with the given element.

If you extend the left and right edges of the element to the top and bottom of the page, all elements which are partially within this box are returned. By default, only elements on the same page as the given element are included, but you can pass inclusive=True to include all pages.

This is equivalent to doing foo.above(…) | foo.below(…).

Note

Technically the element you specify will satisfy the condition, but we assume you do not want that element returned. If you do, you can pass inclusive=True.

Parameters:
  • element (PDFElement) – The element in question.
  • inclusive (bool, optional) – Whether the include element in the returned results. Default: False.
  • all_pages (bool, optional) – Whether to included pages other than the page which the element is on.
  • tolerance (int, optional) – To be counted as in line with, the elements must overlap by at least tolerance on the X axis. Tolerance is capped at half the width of the element. Default 0.
Returns:

The filtered list.

Return type:

ElementList