Name	Name	Last commit message	Last commit date
parent directory ..
.gitignore	.gitignore
README.md	README.md
__init__.py	__init__.py
chromadb_writer.py	chromadb_writer.py
common_objects.py	common_objects.py
docker-compose.yaml	docker-compose.yaml
embedding.py	embedding.py
images.py	images.py
lancedb_functions.py	lancedb_functions.py
pdf_parser.py	pdf_parser.py
requirements.txt	requirements.txt
workflow.py	workflow.py

Name

Last commit message

Last commit date

PDF Document Extraction and Indexing

The example builds a pipeline that extracts text, tables and figures from a PDF Document. It embeds the text, table and images from the document and writes them into LanceDB.

The pipeline is hosted on a server endpoint in one of the containers. The endpoint can be called from any Python application.

Start the Server Endpoint

docker compose up

Calling the Endpoint

from indexify import RemoteGraph

graph = RemoteGraph.by_name("Extract_pages_tables_images_pdf")
invocation_id = graph.run(block_until_done=True, url="")

Outputs

You can read the output of every function of the Graph.

chunks = graph.get_output(invocation_id, "extract_chunks")

The lancedb table is populated automatically by the LanceDBWriter class. The name of the database used in the example is vectordb.lance. Its created in the folder where the docker compose lives, and bind-mounted into the container for LanceDB to write to it. You can change the code to make it write anywhere you want.

Customization

Copy the folder, modify the code as you like and simply upload the new Graph.

python workflow.py

Using GPU

You have to make a couple of changes to use GPUs for PDF parsing.

Uncomment the lines in the pdf-parser-executor block which mention uncommenting them would enable GPUs.
Use the gpu_image in the PDFParser, extract_chunks and extract_images class/functions so that the workflow routes the PDFParser into the GPU enabled image.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

README.md

PDF Document Extraction and Indexing

Start the Server Endpoint

Calling the Endpoint

Outputs

Customization

Using GPU

Uh oh!

FilesExpand file tree

pdf_document_extraction

Directory actions

More options

Directory actions

More options

Latest commit

History

pdf_document_extraction

Folders and files

parent directory

README.md

PDF Document Extraction and Indexing

Start the Server Endpoint

Calling the Endpoint

Outputs

Customization

Using GPU