Enhancing Screenshots in GNOME with OCR

Feb 6, 2025    #linux   #python   #machine-learning   #project  

Enhancing Screenshots in GNOME with OCR

A while ago, I was working on a project, and the way error was being displayed, I could not copy the text directly to paste it in ChatGPT. I had to type the whole error message manually, which was a bit frustrating. I thought, what if I could just take a screenshot and extract the text from the image directly? That’s when I decided to enhance GNOME Screenshot with OCR. I mean windows snipping tool has this feature, why not GNOME Screenshot?

That’s one the perks of using Linux, I can just make something for my needs and share with others, someone who needs can also use, someone who doesn’t need can just ignore it. So, I started working on this project, and I am happy to share that I have successfully enhanced GNOME Screenshot with OCR. In this article, I will share how I did it and how you can also use it.

The code is available here with MIT license: github.com/funinkina

πŸ“‹ Prerequisites

I had strict requirements for this project, I wanted to use only open-source tools and libraries, with minimal dependencies, so other’s don’t have to install a lot of things to use this. Also it needed to simple and single file, so it can be executed easily. Here are the tools and libraries I used for this project:

🎨 Design Overview

The design of the project is simple, I have used Python to create a script that takes a screenshot using XDP (xdg-desktop-portal), then it extracts the text from the image using Tesseract OCR, and then it displays the extracted text in a dialog box using Adwaita and GTK. The script is designed to be executed using a keyboard shortcut, so you can take a screenshot and extract the text from the image with just a single key press.

The script follows a modular and event-driven design with the following components:

πŸ—οΈ Architecture Breakdown

The script is object-oriented, with the following key classes:

πŸ–₯️ 1. TextDialog (UI for Extracted Text)

Displays the recognized text in a Gtk.TextView inside a scrollable window. Provides “Save to File” and “Copy to Clipboard” buttons for user actions. Uses Adw.ToastOverlay for improved UI experience.

πŸš€ 2. GnomeOCRApp (Main Application)

Handles the application lifecycle and integrates:

Screenshot capture via Xdp.Portal.take_screenshot(). OCR text extraction via pytesseract.image_to_string(). GUI window management using GTK 4.

βš™οΈ 3. argparse (CLI Arguments)

Supports optional flags:

✨ Key Features & Design Considerations

βœ” Minimal UI Footprint – The main window is invisible, and only the extracted text is displayed.

βœ” Flexible Text Handling – Users can edit the extracted text before saving/copying.

βœ” Language Support – The OCR language can be customized via –lang.

βœ” Clipboard & File System Integration – Text can be saved or copied seamlessly.

βœ” Automatic Cleanup – The script deletes temporary files unless explicitly saved.

🌟 Here’s a demo screenshot of the script in action:

Gnome Screenshot with OCR

πŸ› οΈ Installation & Usage

You can find the installation instructions and usage guide in the README file of the project repository: Gnome-OCR-Screenshot

Please star the repository if you find the project useful and feel free to contribute to the project by creating issues or pull requests.