Handling PDF files in Python using PyMuPDF (2023)

Handling PDF files in Python using PyMuPDF (1)

In this blog, we will learn how to handle PDF files in Python using PyMuPDF, a library that provides a Pythonic interface to the MuPDF library.

The MuPDF library is a lightweight, high-quality PDF renderer that is written in portable C code. It is designed to be fast and memory-efficient, making it well-suited for use in applications that need to work with large PDF files or process a large number of PDFs in a short amount of time.

With PyMuPDF, we can open and read PDF files, extract text and images, add text and images to PDF files, and perform various other operations on PDF files. In addition to its core functionality, PyMuPDF also provides several convenience features that make it easier to work with PDFs in Python. For example, it includes support for bookmarking, annotations, and form filling, as well as support for password-protected PDFs.

To get started with PyMuPDF, you will need to install the library and its dependencies. This can be done using pip:

pip install pymupdf

Once PyMuPDF is installed, you can begin using it in your Python code.

Here are the most commonly used examples,

Open PDF file

To open a PDF file using PyMuPDF, you can use the open function of the fitz module. This function takes the path of the PDF file as an argument and returns a Document object representing the PDF file.

Here is an example of how to open a PDF file using PyMuPDF:

import fitz

# Open the PDF document
doc = fitz.open("document.pdf")

This will open the document.pdf file and return a Document object representing the file. You can then use various methods and properties of the Document object to access and manipulate the contents of the PDF file.

For example, you can use the page_count property of the Document object to get the number of pages in the PDF file, and you can use the indexing operator (e.g., doc[i]) to get a specific page from the file.

(Video) Extract Text From Pdf File Using Python || pyMuPdf || NLP

You can also use the metadata property of the Document object to get metadata about the PDF file, such as the title, author, and subject.

Extract text from a PDF

To extract text from a PDF file into a list using PyMuPDF, you can use the get_text method of the Page object and append the extracted text to a list.

Here is an example of how to extract all the text from a PDF file and store it in a list:

import fitz

# Open the PDF document
doc = fitz.open("document.pdf")

# Create an empty list to store the text
text_list = []

# Iterate over all the pages in the document
for page in doc:
# Extract the text from the page
text = page.get_text()

# Append the text to the list
text_list.append(text)

# Print the list
print(text_list)

This will extract all the text from the document.pdf file and store it in the text_list variable. The text from each page will be stored as a separate element in the list.

You can also extract text from a specific page by using the indexing operator (e.g., doc[i].get_text()) to get the desired page and then calling the get_text method on that page.

Keep in mind that the get_text method may not always produce perfect results, especially for complex or poorly formatted PDF files. It may miss some text or include extra characters. You may need to do additional processing to clean up the extracted text.

Add text to a PDF file

(Video) Installing pymupdf | PDF handling with python | #pyGuru

To add text to a PDF file using PyMuPDF, you can use the insert_text method of the Page object. This method takes the text to be added, the position of the text on the page, and the font size as arguments, and adds a textbox with the given text to the page at the specified position.

Here is an example of how to add a textbox with some text to the first page of a PDF document:

import fitz

# Open the PDF document
doc = fitz.open("document.pdf")

# Get the first page
page = doc[0]

# Set the font size
font_size = 20

# Set the position of the textbox on the page
x = 50
y = 50

# Set the text to be added
text = "This is some text"

# Add the textbox to the page
page.insert_text((x, y), text, fontsize=font_size)

# Save the changes to the PDF
doc.save("modified_document.pdf")

This will add a textbox with the text “This is some text” to the first page of the document.pdf file at position (50, 50) with a font size of 20. The modified page will be saved to a new file called modified_document.pdf.

You can customize the position, font size, and other formatting options of the textbox as needed. You can also add text to multiple pages by repeating the above steps for each page.

Rotate pages in a PDF document

To rotate pages in a PDF document using PyMuPDF, you can use the set_rotate method of the Page object. This method takes an angle as an argument and rotates the page by that angle. You can use the get_rotate method to get the rotation value of the current page.

Here is an example of how to rotate all the pages in a PDF document by 90 degrees:

(Video) 📌 Get Text and Image from PDF in Python - PyMuPDF 📌

import fitz

# Open the PDF document
doc = fitz.open("document.pdf")

# Iterate over all the pages in the document
for page in doc:
# Rotate the page by 90 degrees
page.set_rotation(90)

# Save the changes to the PDF
doc.save("rotated_document.pdf")

This will rotate all the pages in the document.pdf file by 90 degrees and save the rotated pages to a new file called rotated_document.pdf. You can specify any angle between 0 and 360 to rotate the pages.

Extract images from a PDF file

To extract images from a PDF file using PyMuPDF, you can use the get_pixmap method of the Page object. This method returns an Pixmap object, which represents an image. You can then save this image to a file using the save method of the Pixmap object.

Here is an example of how to extract all the images from a PDF file and save them to image files:

import fitz

# Open the PDF document
doc = fitz.open("document.pdf")

# Iterate over all the pages in the document
for i in range(doc.page_count):
# Get the current page
page = doc[i]

# Extract all the images on the page
for img in page.get_images():
# Get the image data
pix = fitz.Pixmap(doc, img)

# Save the image to a file
pix.save("image{}.png".format(i))

# Free the memory used by the Pixmap object
pix = None

This will extract all the images from the document.pdf file and save them as image files with names like image0.png, image1.png, etc. The images will be saved in the same format as they appear in the PDF file.

Merge two PDFs

To merge two PDFs using the insert_pdf method of PyMuPDF, you can use the following code:

import fitz

# Open the first PDF document
doc1 = fitz.open('document1.pdf')

# Open the second PDF document
doc2 = fitz.open('document2.pdf')

# Insert the second document into the first document
doc1.insert_pdf(doc2)

# Save the merged PDF
doc1.save('merged.pdf')

This will create a new PDF file called merged.pdf that contains the pages from both document1.pdf and document2.pdf. The pages from document2.pdf will be appended to the end of document1.pdf.

(Video) How to Extract Text From PDF File In Python - PyMuPDF

Delete a page from a PDF

To delete a page from a PDF using PyMuPDF, you can use the delete_page method of the Document object. Here is an example of how to delete the second page of a PDF:

import fitz

# Open the PDF document
doc = fitz.open('document.pdf')

# Delete the second page
doc.delete_page(1)

# Save the modified PDF
doc.save('modified.pdf')

This will create a new PDF file called modified.pdf with the second page removed. Note that the page numbering is zero-based, so to delete the second page you need to specify the index 1.

If you want to delete multiple pages at once, you can pass a list of page indices to the delete_pages method.

(Video) Extract text from PDF documents using the PyMuPDF in Python

As you can see, PyMuPDF is a powerful and easy-to-use library for working with PDF files in Python. Whether you need to extract text and images from PDFs or add and modify content in existing PDFs, PyMuPDF has you covered.

PyMuPDF documentation is a great resource, which you can access from here. If you want to see more examples, you can explore them here.

FAQs

How to extract the data from PDF using PyMuPDF? ›

Using PyMuPDF text extraction

open("PyMuPDF. pdf") # open a supported document In [3]: page = doc[0] # load the required page (0-based index) In [4]: text = page. get_text() # extract plain text In [5]: print(text) # process or print it: PyMuPDF Documentation Release 1.20. 0 Artifex Jun 20, 2022 In [6]:

How to read PDF file in Python using PyPDF2? ›

Though PyPDF2 doesn't contain any specific method to read remote files, you can use Python's urllib. request module to first read the remote file in bytes and then pass the file in the bytes format to PdfFileReader() method. The rest of the process is similar to reading a local PDF file.

How does Python handle PDFs? ›

Extract document information from a PDF in Python. Rotate pages. Merge PDFs. Split PDFs.
...
How to Add Watermarks
  1. input_pdf : the PDF file path to be watermarked.
  2. output : the path you want to save the watermarked version of the PDF.
  3. watermark : a PDF that contains your watermark image or text.

Can Python pull data from a PDF? ›

It can retrieve text and metadata from PDFs as well as merge entire files together. Tabula-py is a simple Python wrapper of tabula-java, which can read the table of PDF. You can read tables from PDF and convert into pandas' DataFrame. tabula-py also enables you to convert a PDF file into CSV/TSV/JSON file.

What is the best way to extract data from PDF? ›

The most basic method of extracting data from a PDF file to Excel is to simply copy and paste. This consists of opening the file, selecting the relevant text, and copying and pasting it into an Excel sheet. This method may be the best option if you only have a few PDF files.

What is the easiest way to extract text from a PDF in Python? ›

The tool we are using in this tutorial is PDF Plumber, an open-source python package, it's great, simple and powerful.
...
Click here if you want to check out the PDF I am using in this example.
  1. Import your module. pip install pdfplumber -qimport pdfplumber. ...
  2. open('path/to/directory') ...
  3. pages[ ] ...
  4. extract_text()
Feb 2, 2021

How do I read a PDF programmatically? ›

Opening a PDF file in Android using WebView

All you need to do is just put WebView in your layout and load the desired URL by using the webView. loadUrl() function. Now, run the application on your mobile phone and the PDF will be displayed on the screen.

How do I read a PDF in Robot Framework? ›

Robot script explained
  1. [Arguments] ${pdf_file_name} The keyword gets the file name of a PDF file as an argument.
  2. ${text}= Get Text From Pdf ${pdf_file_name} We extract the text from the PDF file using the Get Text From Pdf keyword provided by the RPA. ...
  3. Create File ${OUTPUT_DIR}${/}${pdf_file_name}.

How do I view a PDF in Python? ›

You can open a PDF file in your standard PDF program such as Adobe Acrobat Reader using the command os. system(path) using the os module and the path string to the PDF file. This opens a command-line shell as an intermediate operating system program that in turn opens the PDF.

What is the best Python PDF library? ›

In this section, we will discover the Top Python PDF Library:
  • PDFMiner. PDFMiner is a tool for extracting information from PDF documents. ...
  • PyPDF2. PyPDF2 is a pure-python PDF library capable of splitting, merging together, cropping, and transforming the pages of PDF files. ...
  • pdfrw.

Which API is used for PDF? ›

PDFBlade. PDFBlade's API allows for the conversion of URLs and plain HTML into PDF files. It's pretty customizable, with various options for outputting PDFs including: 12+ different page formats.

What encoding should I use for PDF? ›

PDF Character Encoding

You can choose to use Windows1252 encoding, the standard Microsoft Windows operating system single-byte encoding for Latin text in Western writing systems, or unicode (UTF-16) encoding. By default, PDF character encoding is determined automatically, based on the characters found in the file.

How do I extract data from a PDF automatically? ›

Here's how:
  1. Collect a batch of sample documents to serve as a training set.
  2. Train the automated software to extract the data according to your needs.
  3. Test and verify.
  4. Run the trained software on real documents.
  5. Process the extracted data.
Dec 12, 2022

How do I import data from PDF to Python? ›

All we need to do is use PyPDF2 to access the XML document from the object structure of this file. Once we have access to the XML, it is a simple exercise of parsing out the XML document to access values for various form elements, which could then be stored into a Python list, Numpy array, Pandas dataframe etc.

How do I scrub data from a PDF? ›

Open the PDF in Acrobat, and then do one of the following:
  1. Choose Tools > Redact.
  2. On the Edit menu, choose Redact Text & Images.
  3. Select the text or image in a PDF, right-click, and select Redact.
  4. Select the text or image in a PDF, choose Redact in the floating context-menu.
Sep 19, 2022

Can you parse data from PDF? ›

A PDF Parser (also sometimes called PDF scraper) is a software that can be used to extract data from PDF documents. PDF Parsers can come in form of libraries for developers or as standalone software products for end-users. PDF Parsers are used mainly to extract data from a batch of PDF files.

How do I convert a PDF to data? ›

Open a PDF file in Acrobat.

Click on the “Export PDF” tool in the right pane. Choose “spreadsheet” as your export format, and then select “Microsoft Excel Workbook.” Click “Export.” If your PDF documents contain scanned text, Acrobat will run text recognition automatically.

How do I convert a PDF to CSV for free? ›

You can convert your PDF to CSV for Google Sheets using Zamzar's online converter – just upload your PDF file, select CSV as the 'convert to' format, and then download the converted file. Once you have the CSV file, you can import it into Google Sheets to open it.

How do I convert a PDF to text in Python? ›

How to convert PDF to TXT
  1. Install 'Aspose. Words for Python via . NET'.
  2. Add a library reference (import the library) to your Python project.
  3. Open the source PDF file in Python.
  4. Call the 'save()' method, passing an output filename with TXT extension.
  5. Get the result of PDF conversion as TXT.

How do I OCR a PDF quickly? ›

Open a PDF file containing a scanned image in Acrobat for Mac or PC. Click on the “Edit PDF” tool in the right pane. Acrobat automatically applies optical character recognition (OCR) to your document and converts it to a fully editable copy of your PDF. Click the text element you wish to edit and start typing.

How to extract data from 100's PDFs in 2 minutes using Python? ›

First we need to import the PyPDF2 lib using this code: import PyPDF2 as pdf and be careful from the case-sensitivity. Then define the path of the folder using os. listdir('the path') and you should name it i.e. path = os. listdir('the path') .

How do I use PDF Generator API? ›

PDF Generator API is one of the most popular solutions and offers a flexible REST API and template editor you can use to generate PDF documents. You first need to create a template, and then you need to pass the template ID and the JSON data through the API to generate the PDF.

Can I read data from PDF using power automate? ›

PDF actions enable you to extract images, text, and tables from PDF files, and arrange pages to create new documents. To extract text from a PDF file, use the Extract text from PDF action.

How do I send a PDF TO REST API? ›

A possible solution is to use the PDF file binary to receive or send by REST.
...
  1. Create an End point REST in Service Studio:
  2. open documentation to get the URL, to use in POSTMAN:
  3. Convert your PDF to binary in some site, example https://base64.guru/converter/encode/pdf.
  4. Send by Postman:
Nov 24, 2020

Is it possible to read PDF through automation Anywhere? ›

By using the integration command we can read PDFs in Automation Anywhere. This command can be used to read single or multiple pages of a PDF document, extract values, merge two PDF documents, and many other things.

Does Command F work on PDF? ›

Displaying the Search/Find Window Pane

When a PDF is opened in the Acrobat Reader (not in a browser), the search window pane may or may not be displayed. To display the search/find window pane, use "Ctrl+F".

Can Tesseract work on PDF? ›

Tesseract reads only image files, not pdf. You can convert PDF to image (tif, png) and OCR those. Or use wrappers that use tesseract.

How to read PDF file from URL in Python? ›

To find PDF and download it, we have to follow the following steps:
  1. Import beautifulsoup and requests library.
  2. Request the URL and get the response object.
  3. Find all the hyperlinks present on the webpage.
  4. Check for the PDF file link in those links.
  5. Get a PDF file using the response object.
Apr 13, 2021

How do I edit a PDF in Python? ›

How to edit PDF
  1. Install PDF Editor for Python.
  2. Add a library reference (import the library) to your Python project.
  3. Open a PDF in Python.
  4. Insert content at the beginning of the PDF document.
  5. Call the 'save()' method, passing the name of the output file with the required extension.
  6. Get the edited result.

How do I read a PDF in Jupyter? ›

If you want to open a pdf file using the jupyter filebrowser, you need to use Firefox - Google Chrome blocks it. Alternatively, to open the pdf inside a jupyter notebook cell, you can use IFrame - but again, it doesn't work for Chrome.

What is the best PDF parser? ›

Parseur : The best PDF parser software in 2023

Parseur is a powerful document processing and PDF parser tool that automatically extracts data from documents such as invoices or bills of lading within seconds. The extracted data can then be downloaded or exported to thousands of applications.

Is 2 months enough for Python? ›

In general, it takes around two to six months to learn the fundamentals of Python. But you can learn enough to write your first short program in a matter of minutes. Developing mastery of Python's vast array of libraries can take months or years.

How do I create a PDF table in Python? ›

Adding Tables on a PDF using Python
  1. from reportlab. lib import colors.
  2. from reportlab. lib. ...
  3. from reportlab. platypus import SimpleDocTemplate, Table, TableStyle.
  4. # creating a pdf file to add tables.
  5. my_doc = SimpleDocTemplate("table. pdf", pagesize = letter)
  6. my_obj = []
  7. # defining Data to be stored on table.
  8. my_data = [

Is JSON and PDF same? ›

Fast analysis of JSON data: JSON, unlike PDF documents, comes in lightweight form, which makes it faster to analyze and store the JSON data. Easy and fast sharing: Thanks to the universal format, JSON is usable with virtually any system, enabling efficient sharing between organizations.

Can Python generate PDF? ›

A common way to create a PDF file is by saving a Word file as . pdf, but we can also create a PDF file using Python. The pros? If you plan to create multiple PDFs using the same template, you could do it with Python!

Is Adobe PDF API free? ›

Adobe PDF Embed API is free to use, so get started today!

Is PDF higher quality than JPEG? ›

PDFs will often be higher quality than JPEGs. This is because JPEGs compress images, leading to a loss in quality that you can't regain. Visuals saved as PDF are of very high quality and even customizable, making them the file format of choice for printers.

Should PDF be RGB or CMYK? ›

PDFs are ideal for CMYK files, because they are compatible with most programs.

Does PDF use UTF-8? ›

In 2017, PDF 2.0 introduced UTF-8 encoded strings as an additional format for PDF text strings, while maintaining full backward-compatible support for the existing UTF-16BE and PDFDocEncoded text string definitions.

Can you automate a PDF file? ›

Steps to automatically process incoming files: Open the PDF Converter Settings panel from Start > Programs > AssistMyTeam PDF Converter for Windows. Alternatively, invoke it from the context menu of Windows Explorer. From the Settings panel, go to Auto PDF tab and check 'Enable Automation' option.

How to extract data from PDF to Excel using Python? ›

Method 2): Using PDFMiner for Extracting Data from PDFs
  1. Create a Folder and place the target PDF file inside. ...
  2. Install Python 3.6 or newer on your computer. ...
  3. Open a command-line interface in the PDF directory. ...
  4. Install PDFMiner. ...
  5. Extract data from PDF.
Jul 7, 2020

Can we convert PDF to CSV Python? ›

You can convert your PDF to Excel, CSV, XML or HTML with Python using the PDFTables API. Our API will enable you to convert PDFs without uploading each one manually.

How to convert PDF into CSV file using Python? ›

Practical Data Science using Python
  1. First, Install the required package by typing pip install tabula-py in the command shell.
  2. Now, read the file using read_pdf("file location", pages=number) function. This will return the DataFrame.
  3. Convert the DataFrame into an Excel file using tabula.
Apr 21, 2021

How do I read a specific text from a PDF in Python? ›

Page object has function extractText() to extract text from the pdf page.

How to extract unstructured data from PDF in Python? ›

There are a couple of Python libraries using which you can extract data from PDFs. For example, you can use the PyPDF2 library for extracting text from PDFs where text is in a sequential or formatted manner i.e. in lines or forms. You can also extract tables in PDFs through the Camelot library.

How do I extract data from a PDF in UiPath? ›

Step 1: Click on the Screen Scraping icon, and select the section in the PDF Document you want to extract. Step 2: Then you will be redirected to the Screen Scraper Wizard that you can see below, with the extracted text. Now, over here you have an option for the Scraping Method (Native/ Full Text/ OCR).

How do I extract data from a PDF table in UiPath? ›

Just as an idea, I proceeded like this:
  1. Extract text from the PDF document with UiPath. ...
  2. Create an array, where the elements are the lines in the document. ...
  3. Go through lines in a loop (ForEach) until the table header is found. ...
  4. The next row belongs to the table as long as it contains a tab.
Jan 21, 2021

How do I extract data from a PDF table? ›

Here's how you can extract tables from a PDF file using Excel:
  1. Open your Excel spreadsheet.
  2. Go to the Data tab.
  3. In the Get & Transform section, click on Get Data.
  4. From the list, select From File and then select From PDF. ...
  5. Select the PDF file you want to extract tables from.
  6. Click Open.
Sep 3, 2022

How do I extract data from a PDF using power automated? ›

To extract tables from a PDF file, deploy the Extract tables from PDF action, select the file, and specify the pages to extract from. The action produces the ExtractedPDFTables variable that contains a list of PDF table info. To find information about this type of list, go to Advanced data types.

How to extract data from PDF using RegEx in UiPath? ›

How to Extract Data With RegEx in UiPath
  1. Using the Matches activity.
  2. Using an Assign. Assign Variables. Use RegEx to assign a value to our output string. Write the output out.

How do I automate a PDF in UiPath? ›

Install the UiPath PDF Activities Package. Extract large text segments from PDF files using different activities. Extract a single piece of information from a PDF document. Use the UI automation capabilities of Studio to extract fluctuating values from multiple files with the same structure.

How to read PDF file programmatically? ›

Opening a PDF file in Android using WebView

All you need to do is just put WebView in your layout and load the desired URL by using the webView. loadUrl() function. Now, run the application on your mobile phone and the PDF will be displayed on the screen.

How do I extract multiple tables from a PDF in Python? ›

Method 1:
  1. Step 1: Import library and define file path. import tabula pdf_path = "https://sedl.org/afterschool/toolkits/science/pdf/ast_sci_data_tables_sample.pdf" Python.
  2. Step 2: Extract table from PDF file. dfs = tabula. ...
  3. Step 3: Write dataframe to CSV file. Simply write the DataFrame to CSV in the same directory:
Jun 27, 2021

Can you generate a table of contents in PDF? ›

Start the Adobe® Acrobat® application and using "File > Open…" menu open a PDF file that contains bookmarks or create bookmarks using any of the available methods. Select "Plug-Ins > Table of Contents > Create TOC From Bookmarks…" to open the "Table Of Contents Settings" dialog.

Can I read data from PDF using Power Automate? ›

Reading a simple PDF file in Microsoft Power Automate can be made simple with the Cloudmersive connector; that connector has a free plan that you can use to determine if your automation will work, or if you're just learning intelligent automation.

Can Power query pull data from PDF? ›

Connect to a PDF file from Power Query Desktop

Then select Open. If the PDF file is online, use the Web connector to connect to the file. In Navigator, select the file information you want, then either select Load to load the data or Transform Data to continue transforming the data in Power Query Editor.

Can Power Automate parse PDF? ›

It's that easy to parse PDF documents using PDF.co Power automate connector. It also works seamlessly with scanned PDFs and images too. So, you learned to extract data from PDF with Power Automate. Please try it out yourself for better exposure.

Videos

1. How to search specific keywords in the pdf document | fitz | pyMuPDF | Python
(SuvethaSuresh)
2. Extract PDF pages | PDF handling with python | #pyGuru
(pyGuru)
3. Extract text from PDF pages | PDF handling with python | #pyGuru
(pyGuru)
4. Extract images from pdf files | pdf handling with python | #pyGuru
(pyGuru)
5. Extracting PDF metadata | PDF handling with python | #pyGuru
(pyGuru)
6. How To Read PDF Files in Python using PyPDF2
(Mukesh otwani)
Top Articles
Latest Posts
Article information

Author: Delena Feil

Last Updated: 04/10/2023

Views: 5595

Rating: 4.4 / 5 (65 voted)

Reviews: 88% of readers found this page helpful

Author information

Name: Delena Feil

Birthday: 1998-08-29

Address: 747 Lubowitz Run, Sidmouth, HI 90646-5543

Phone: +99513241752844

Job: Design Supervisor

Hobby: Digital arts, Lacemaking, Air sports, Running, Scouting, Shooting, Puzzles

Introduction: My name is Delena Feil, I am a clean, splendid, calm, fancy, jolly, bright, faithful person who loves writing and wants to share my knowledge and understanding with you.