Extracting Text From FiguresΒΆ
PDFs are structured documents, and can contain Figures. By default, PDFMiner.six and hence py-pdf-parser does not extract text from figures.
You can download an example here
. In the
example, there is figure which contains a red square, and some text. Below the figure
there is some more text.
By default, the text in the figure will not be included:
from py_pdf_parser.loaders import load_file
document = load_file("figure.pdf")
print([element.text() for element in document.elements])
which results in:
["Here is some text outside of an image"]
To include the text inside the figure, we must pass the all_texts
layout parameter.
This is documented in the PDFMiner.six documentation, here.
The layout parameters can be passed to both load()
and
load_file()
as a dictionary to the la_params
argument.
In our case:
from py_pdf_parser.loaders import load_file
document = load_file("figure.pdf", la_params={"all_texts": True})
print([element.text() for element in document.elements])
which results in:
["This is some text in an image", "Here is some text outside of an image"]