In this tutorial, we’ll walk you through how to extract hyperlinks from a PDF document using Python. PDFs often contain embedded hyperlinks that may be useful to extract for tasks such as data collection, web scraping, or content analysis.

Prerequisites

To follow along with this tutorial, you will need:

  • Python 3.x installed on your machine.
  • The following Python libraries:
    • PyPDF2: A library for reading and manipulating PDF files.
    • pdfplumber: A library for extracting data (text, images, tables) from PDFs.

You can install these libraries using pip:

pip install PyPDF2 pdfplumber

Step 1: Import Required Libraries

import PyPDF2
import pdfplumber

Step 2: Open the PDF File

To begin, open the PDF file from which you want to extract the hyperlinks:

def extract_hyperlinks_from_pdf(file_path):
    hyperlinks = []

    with pdfplumber.open(file_path) as pdf:
        for page_num, page in enumerate(pdf.pages):
            # Extract the annotations (where links are typically stored)
            if 'Annots' in page.objects:
                annotations = page.objects['Annots']
                for annotation in annotations:
                    uri = annotation.get('URI')
                    if uri:
                        hyperlinks.append(uri)

    return hyperlinks

Explanation:

1. pdfplumber is used to open and read the PDF file.

2. We loop through each page in the PDF and check if the page contains annotations (where links are often stored).

3. If an annotation contains a 'URI', we extract it and append it to our hyperlinks list.

Step 3: Run the Function

You can use the function to extract hyperlinks from a PDF by providing the file path of the PDF.

file_path = 'example.pdf'
links = extract_hyperlinks_from_pdf(file_path)

print("Extracted Links:")
for link in links:
    print(link)

Output:

The script will output a list of hyperlinks found in the PDF:

Extracted Links:
https://www.example.com
https://www.another-example.com

Full Script:

import pdfplumber

def extract_hyperlinks_from_pdf(file_path):
    hyperlinks = []

    with pdfplumber.open(file_path) as pdf:
        for page_num, page in enumerate(pdf.pages):
            if 'Annots' in page.objects:
                annotations = page.objects['Annots']
                for annotation in annotations:
                    uri = annotation.get('URI')
                    if uri:
                        hyperlinks.append(uri)

    return hyperlinks


# Usage
file_path = 'example.pdf'
links = extract_hyperlinks_from_pdf(file_path)

print("Extracted Links:")
for link in links:
    print(link)

Conclusion

With the above code, you can easily extract hyperlinks from PDF files using Python. This can be useful for web scraping, content analysis, or simply collecting URLs for further research.

Leave a Reply

Your email address will not be published. Required fields are marked *