How to Extract Hyperlinks from a PDF in Python

In this tutorial, we’ll walk you through how to extract hyperlinks from a PDF document using Python. PDFs often contain embedded hyperlinks that may be useful to extract for tasks such as data collection, web scraping, or content analysis.

Prerequisites

To follow along with this tutorial, you will need:

Python 3.x installed on your machine.
The following Python libraries:
- PyPDF2: A library for reading and manipulating PDF files.
- pdfplumber: A library for extracting data (text, images, tables) from PDFs.

You can install these libraries using pip:

pip install PyPDF2 pdfplumber

Step 1: Import Required Libraries

import PyPDF2
import pdfplumber

Step 2: Open the PDF File

To begin, open the PDF file from which you want to extract the hyperlinks:

def extract_hyperlinks_from_pdf(file_path):
    hyperlinks = []

    with pdfplumber.open(file_path) as pdf:
        for page_num, page in enumerate(pdf.pages):
            # Extract the annotations (where links are typically stored)
            if 'Annots' in page.objects:
                annotations = page.objects['Annots']
                for annotation in annotations:
                    uri = annotation.get('URI')
                    if uri:
                        hyperlinks.append(uri)

    return hyperlinks

Explanation:

1. pdfplumber is used to open and read the PDF file.

2. We loop through each page in the PDF and check if the page contains annotations (where links are often stored).

3. If an annotation contains a 'URI', we extract it and append it to our hyperlinks list.

Step 3: Run the Function

You can use the function to extract hyperlinks from a PDF by providing the file path of the PDF.

file_path = 'example.pdf'
links = extract_hyperlinks_from_pdf(file_path)

print("Extracted Links:")
for link in links:
    print(link)

Output:

The script will output a list of hyperlinks found in the PDF:

Extracted Links:
https://www.example.com
https://www.another-example.com

Full Script:

import pdfplumber

def extract_hyperlinks_from_pdf(file_path):
    hyperlinks = []

    with pdfplumber.open(file_path) as pdf:
        for page_num, page in enumerate(pdf.pages):
            if 'Annots' in page.objects:
                annotations = page.objects['Annots']
                for annotation in annotations:
                    uri = annotation.get('URI')
                    if uri:
                        hyperlinks.append(uri)

    return hyperlinks


# Usage
file_path = 'example.pdf'
links = extract_hyperlinks_from_pdf(file_path)

print("Extracted Links:")
for link in links:
    print(link)

Conclusion

With the above code, you can easily extract hyperlinks from PDF files using Python. This can be useful for web scraping, content analysis, or simply collecting URLs for further research.

How to Extract Hyperlinks from a PDF in Python

ByTechgyve Staff

Prerequisites

Step 1: Import Required Libraries

Step 2: Open the PDF File

Explanation:

Step 3: Run the Function

Output:

Full Script:

Conclusion

By Techgyve Staff

Related Post

How to Find the K’th Non-Repeating Character in Python Using List Comprehension and OrderedDict

Leave a Reply Cancel reply

You missed

Understanding PHP include and require: A Beginner’s Guide

How to Create a Responsive Modal Sign-Up Form for a Website

How To Create A Countdown Timer Using JavaScript

How to Sort Arrays in PHP 5