Element Ordering

In this example, we see how to specify a custom ordering for the elements.

For this we will use a simple pdf, which has a single element in each corner of the page. You can download the example here.

Default

The default element ordering is left to right, top to bottom.

from py_pdf_parser.loaders import load_file

file_path = "grid.pdf"

# Default - left to right, top to bottom
document = load_file(file_path)
print([element.text() for element in document.elements])

This results in

['Top Left', 'Top Right', 'Bottom Left', 'Bottom Right']

Presets

There are also preset orderings for right to left, top to bottom, top to bottom, left to right, and top to bottom, right to left. You can use these by importing the ElementOrdering class from py_pdf_parser.components and passing these as the element_ordering argument to PDFDocument. Note that keyword arguments to load() and load_file() get passed through to the PDFDocument.

from py_pdf_parser.loaders import load_file
from py_pdf_parser.components import ElementOrdering

# Preset - right to left, top to bottom
document = load_file(
    file_path, element_ordering=ElementOrdering.RIGHT_TO_LEFT_TOP_TO_BOTTOM
)
print([element.text() for element in document.elements])

# Preset - top to bottom, left to right
document = load_file(
    file_path, element_ordering=ElementOrdering.TOP_TO_BOTTOM_LEFT_TO_RIGHT
)
print([element.text() for element in document.elements])

# Preset - top to bottom, right to left
document = load_file(
    file_path, element_ordering=ElementOrdering.TOP_TO_BOTTOM_RIGHT_TO_LEFT
)
print([element.text() for element in document.elements])

which results in

['Top Right', 'Top Left', 'Bottom Right', 'Bottom Left']
['Bottom Left', 'Top Left', 'Bottom Right', 'Top Right']
['Top Right', 'Bottom Right', 'Top Left', 'Bottom Left']

Custom Ordering

If none of the presets give an ordering you are looking for, you can also pass a callable as the element_ordering argument of PDFDocument. This callable will be given a list of elements for each page, and should return a list of the same elements, in the desired order.

Important

The elements which get passed to your function will be PDFMiner.six elements, and NOT class PDFElement. You can access the x0, x1, y0, y1 directly, and extract the text using get_text(). Other options are available: please familiarise yourself with the PDFMiner.six documentation.

Note

Your function will be called multiple times, once for each page of the document. Elements will always be considered in order of increasing page number, your function only controls the ordering within each page.

For example, if we wanted to implement an ordering which is bottom to top, left to right then we can do this as follows:

from py_pdf_parser.loaders import load_file

# Custom - bottom to top, left to right
def ordering_function(elements):
    """
    Note: Elements will be PDFMiner.six elements. The x axis is positive as you go left
    to right, and the y axis is positive as you go bottom to top, and hence we can
    simply sort according to this.
    """
    return sorted(elements, key=lambda elem: (elem.x0, elem.y0))


document = load_file(file_path, element_ordering=ordering_function)
print([element.text() for element in document.elements])

which results in

['Bottom Left', 'Top Left', 'Bottom Right', 'Top Right']

Multiple Columns

Finally, suppose our PDF has multiple columns, like this example.

If we don’t specify an element_ordering, the elements will be extracted in the following order:

['Column 1 Title', 'Column 2 Title', 'Here is some column 1 text.', 'Here is some column 2 text.', 'Col 1 left', 'Col 1 right', 'Col 2 left', 'Col 2 right']

If we visualise this document (see the Simple Memo example if you don’t know how to do this), then we can see that the column divider is at an x value of about 300. Using this information, we can specify a custom ordering function which will order the elements left to right, top to bottom, but in each column individually.

from py_pdf_parser.loaders import load_file

document = load_file("columns.pdf")

def column_ordering_function(elements):
    """
    The first entry in the key is False for colum 1, and Tru for column 2. The second
    and third keys just give left to right, top to bottom.
    """
    return sorted(elements, key=lambda elem: (elem.x0 > 300, -elem.y0, elem.x0))


document = load_file(file_path, element_ordering=column_ordering_function)
print([element.text() for element in document.elements])

which returns the elements in the correct order:

['Column 1 Title', 'Here is some column 1 text.', 'Col 1 left', 'Col 1 right', 'Column 2 Title', 'Here is some column 2 text.', 'Col 2 left', 'Col 2 right']