Extracting Text From Figures¶

PDFs are structured documents, and can contain Figures. By default, PDFMiner.six and hence py-pdf-parser does not extract text from figures.

You can download an example here. In the example, there is figure which contains a red square, and some text. Below the figure there is some more text.

By default, the text in the figure will not be included:

from py_pdf_parser.loaders import load_file
document = load_file("figure.pdf")
print([element.text() for element in document.elements])

which results in:

["Here is some text outside of an image"]

To include the text inside the figure, we must pass the all_texts layout parameter. This is documented in the PDFMiner.six documentation, here.

The layout parameters can be passed to both load() and load_file() as a dictionary to the la_params argument.

In our case:

from py_pdf_parser.loaders import load_file
document = load_file("figure.pdf", la_params={"all_texts": True})
print([element.text() for element in document.elements])

which results in:

["This is some text in an image", "Here is some text outside of an image"]