Tables

py_pdf_parser.tables.add_header_to_table(table: List[List[str]], header: Optional[List[str]] = None) → List[Dict[str, str]]

Given a table (list of lists) of strings, returns a list of dicts mapping the table header to the values.

Given a table, a list of rows which are lists of strings, returns a new table which is a list of rows which are dictionaries mapping the header values to the table values.

Parameters:
  • table – The table (a list of lists of strings).
  • header (list, optional) – The header to use. If not provided, the first row of the table will be used instead. Your header must be the same width as your table, and cannot contain the same entry multiple times.
Raises:

InvalidTableHeaderError – If the width of the header does not match the width of the table, or if the header contains duplicate entries.

Returns:

A list of dictionaries, where each entry in the list is a row in the table, and a row in the table is represented as a dictionary mapping the header to the values.

Return type:

list[dict]

py_pdf_parser.tables.extract_simple_table(elements: ElementList, as_text: bool = False, strip_text: bool = True, allow_gaps: bool = False, reference_element: Optional[PDFElement] = None, tolerance: float = 0.0, remove_duplicate_header_rows: bool = False) → List[List[T]]

Returns elements structured as a table.

Given an ElementList, tries to extract a structured table by examining which elements are aligned.

To use this function, there must be at least one full row and one full column (which we call the reference row and column), i.e. the reference row must have an element in every column, and the reference column must have an element in every row. The reference row and column can be specified by passing the single element in both the reference row and the reference column. By default, this is the top left element, which means we use the first row and column as the references. Note if you need to change the reference_element, that means you have gaps in your table, and as such you will need to pass allow_gaps=True.

Important: This function uses the elements in the reference row and column to scan horizontally and vertically to find the rest of the table. If there are gaps in your reference row and column, this could result in rows and columns being missed by this function.

There must be a clear gap between each row and between each column which contains no elements, and a single cell cannot contain multiple elements.

If there are no valid reference rows or columns, try extract_table() instead. If you have elements spanning multiple rows or columns, it may be possible to fix this by using extract_table(). If you fail to satisfy any of the other conditions listed above, that case is not yet supported.

Parameters:
  • elements (ElementList) – A list of elements to extract into a table.
  • as_text (bool, optional) – Whether to extract the text from each element instead of the PDFElement itself. Default: False.
  • strip_text (bool, optional) – Whether to strip the text for each element of the table (Only relevant if as_text is True). Default: True.
  • allow_gaps (bool, optional) – Whether to allow empty spaces in the table.
  • reference_element (PDFElement, optional) – An element in a full row and a full column. Will be used to specify the reference row and column. If None, the top left element will be used, meaning the top row and left column will be used. If there are gaps in these, you should specify a different reference. Default: None.
  • tolerance (int, optional) – For elements to be counted as in the same row or column, they must overlap by at least tolerance. Default: 0.
  • remove_duplicate_header_rows (bool, optional) – Remove duplicates of the header row (the first row) if they exist. Default: False.
Raises:

TableExtractionError – If something goes wrong.

Returns:

a list of rows, which are lists of PDFElements or strings

(depending on the value of as_text).

Return type:

list[list]

py_pdf_parser.tables.extract_table(elements: ElementList, as_text: bool = False, strip_text: bool = True, fix_element_in_multiple_rows: bool = False, fix_element_in_multiple_cols: bool = False, tolerance: float = 0.0, remove_duplicate_header_rows: bool = False) → List[List[T]]

Returns elements structured as a table.

Given an ElementList, tries to extract a structured table by examining which elements are aligned. There must be a clear gap between each row and between each column which contains no elements, and a single cell cannot contain multiple elements.

If you fail to satisfy any of the other conditions listed above, that case is not yet supported.

Note: If you satisfy the conditions to use extract_simple_table, then that should be used instead, as it’s much more efficient.

Parameters:
  • elements (ElementList) – A list of elements to extract into a table.
  • as_text (bool, optional) – Whether to extract the text from each element instead of the PDFElement itself. Default: False.
  • strip_text (bool, optional) – Whether to strip the text for each element of the table (Only relevant if as_text is True). Default: True.
  • fix_element_in_multiple_rows (bool, optional) – If a table element is in line with elements in multiple rows, a TableExtractionError will be raised unless this argument is set to True. When True, any elements detected in multiple rows will be placed into the first row. This is only recommended if you expect this to be the case in your table. Default: False.
  • fix_element_in_multiple_cols (bool, optional) – If a table element is in line with elements in multiple cols, a TableExtractionError will be raised unless this argument is set to True. When True, any elements detected in multiple cols will be placed into the first col. This is only recommended if you expect this to be the case in your table. Default: False.
  • tolerance (int, optional) – For elements to be counted as in the same row or column, they must overlap by at least tolerance. Default: 0.
  • remove_duplicate_header_rows (bool, optional) – Remove duplicates of the header row (the first row) if they exist. Default: False.
Raises:

TableExtractionError – If something goes wrong.

Returns:

a list of rows, which are lists of PDFElements or strings

(depending on the value of as_text).

Return type:

list[list]

py_pdf_parser.tables.get_text_from_table(table: List[List[Optional[PDFElement]]], strip_text: bool = True) → List[List[str]]

Given a table (of PDFElements or None), returns a table (of element.text() or ‘’).

Parameters:
  • table – The table (a list of lists of PDFElements).
  • strip_text (bool, optional) – Whether to strip the text for each element of the table. Default: True.
Returns:

a list of rows, which are lists of strings.

Return type:

list[list[str]]