In this tutorial, we’ll walk you through how to extract hyperlinks from a PDF document using Python. PDFs often contain embedded hyperlinks that may be useful to extract for tasks such as data collection, web scraping, or content analysis.
Prerequisites
To follow along with this tutorial, you will need:
- Python 3.x installed on your machine.
- The following Python libraries:
PyPDF2
: A library for reading and manipulating PDF files.pdfplumber
: A library for extracting data (text, images, tables) from PDFs.
You can install these libraries using pip:
pip install PyPDF2 pdfplumber
Step 1: Import Required Libraries
import PyPDF2
import pdfplumber
Step 2: Open the PDF File
To begin, open the PDF file from which you want to extract the hyperlinks:
def extract_hyperlinks_from_pdf(file_path):
hyperlinks = []
with pdfplumber.open(file_path) as pdf:
for page_num, page in enumerate(pdf.pages):
# Extract the annotations (where links are typically stored)
if 'Annots' in page.objects:
annotations = page.objects['Annots']
for annotation in annotations:
uri = annotation.get('URI')
if uri:
hyperlinks.append(uri)
return hyperlinks
Explanation:
1. pdfplumber
is used to open and read the PDF file.
2. We loop through each page in the PDF and check if the page contains annotations (where links are often stored).
3. If an annotation contains a 'URI'
, we extract it and append it to our hyperlinks
list.
Step 3: Run the Function
You can use the function to extract hyperlinks from a PDF by providing the file path of the PDF.
file_path = 'example.pdf'
links = extract_hyperlinks_from_pdf(file_path)
print("Extracted Links:")
for link in links:
print(link)
Output:
The script will output a list of hyperlinks found in the PDF:
Extracted Links:
https://www.example.com
https://www.another-example.com
Full Script:
import pdfplumber
def extract_hyperlinks_from_pdf(file_path):
hyperlinks = []
with pdfplumber.open(file_path) as pdf:
for page_num, page in enumerate(pdf.pages):
if 'Annots' in page.objects:
annotations = page.objects['Annots']
for annotation in annotations:
uri = annotation.get('URI')
if uri:
hyperlinks.append(uri)
return hyperlinks
# Usage
file_path = 'example.pdf'
links = extract_hyperlinks_from_pdf(file_path)
print("Extracted Links:")
for link in links:
print(link)
Conclusion
With the above code, you can easily extract hyperlinks from PDF files using Python. This can be useful for web scraping, content analysis, or simply collecting URLs for further research.