OCR Extractor

by Johnathan Ritzi
5
4
3
2
1
Score: 52/100

Description

The OCR Extractor plugin focuses on turning embedded documents and images into searchable text using optical character recognition. It processes attachments already present in notes and converts the extracted content into clean Markdown, placing it directly below the original file inside a collapsible callout. This approach keeps the raw files untouched while still making their contents visible, searchable, and indexable by both internal search and system-level tools. The plugin supports batch extraction, either for a single note or across the entire vault, with progress shown in the status bar and the option to cancel midway. Text extraction is powered by Mistral OCR, which handles complex layouts better than basic OCR engines.

Reviews

No reviews yet.

Stats

25
stars
2,318
downloads
3
forks
141
days
3
days
36
days
38
total PRs
0
open PRs
5
closed PRs
33
merged PRs
6
total issues
0
open issues
6
closed issues
0
commits

RequirementsExperimental

Latest Version

a month ago

Changelog

Added

  • Added a setting to automatically extract text when a new attachment is added to a note

Changed

  • Added logic to prevent attempting to extract text from Obsidian-native file types (Markdown, Canvas, Base)
  • Added the plugin name to the beginning of some notices, to make it clear what plugin they're related to
  • Dependency upgrades and developer improvements

Full Changelog: https://github.com/jritzi/ocr-extractor/compare/2.0.1...2.1.0

README file from

Github

OCR Extractor - Obsidian Plugin

About

OCR Extractor is a simple Obsidian plugin that uses OCR to extract text from PDFs, documents, images, etc. embedded in your notes. Different OCR services (free or paid, local or cloud-based) are available, depending on your needs.

Following Obsidian's philosophy of storing data in an open, future-proof file format, the extracted text is added below the embedded attachment as an expandable callout. This means that the text will be searchable via Obsidian's built-in search, other search plugins, and even your operating system's native file search.

Usage

Click on the ribbon icon (or use the command palette) and select one of the two options:

  1. Extract text in current note
  2. Extract text in all notes (not available on mobile)

When extracting from all notes, you can see the progress in the status bar, or click it and select "Cancel" to cancel the operation.

OCR services

Depending on your needs, you can choose which OCR service to use. Select the service in the plugin settings and follow the setup steps below.

Tesseract

Tesseract (the default option) is a popular open source OCR engine. It has some limitations (only supports English text, can only process PDFs and images, can be slower, and can be less accurate), but it's completely free and local (ensuring your data is never sent to a third-party provider). This option requires no additional setup.

Mistral OCR

Mistral OCR is a powerful AI model for extracting text from complex documents and converting it to Markdown. It supports many different languages and file types. This option requires a paid Mistral AI account (at the time of writing, it costs $2 per 1000 pages processed). Attachments are sent to Mistral's OCR service for text extraction (see their privacy policy).

First, you need to create a Mistral AI account. Follow the steps in their Quickstart guide:

  1. Create an account
  2. Add payment information
  3. Recommended: Set a monthly spending limit, to avoid any unexpected charges
  4. Create an API key

Then, enter your API key in the plugin settings.

Custom command

For advanced use cases, you can provide a custom command that will be used to process attachments. This can be used, for example, to extract text with an OCR model running locally, a script that uses a third-party API (that isn't supported natively by the plugin), or Tesseract with a custom configuration.

Enter your custom command in the plugin settings, where {input} is the path to the input attachment file and {output} is the path to the produced Markdown or text file containing the extracted text. To skip an unsupported attachment, don't create the output file. For example:

tesseract {input} - -l eng+spa > {output}

Click the "Test" button to run the command on a sample image with text and confirm it correctly extracts the text. If the custom command only supports images, enable the setting to convert PDFs to PNGs before processing.

Note that this option is not supported on mobile, so if a custom command is configured, the plugin will use Tesseract on mobile instead of running the custom command.

Contributing

For details on how to report a bug, share a feature request, or contribute code, see the Contribution Guidelines. To report a security issue, see the Security Policy.

Translations

OCR Extractor is available in several languages. To request a new language (or to suggest an improvement for an existing translation), start a discussion.

License

OCR Extractor is licensed under the MIT License.

Similar Plugins

info
• Similar plugins are suggested based on the common tags between the plugins.
Obsidian OCR
4 years ago by Jonas Mohr
Obsidian OCR allows you to search for text in your images and pdfs
Text Extractor
3 years ago by Simon Cambier
A (companion) plugin to facilitate the extraction of text from images (OCR) and PDFs.
Image OCR
3 years ago by kaffarell
Runs ocr on pasted images and posts result in details box. This allows to search in images.
MathLive
3 years ago by Dan Zilberman
The must-have plugin for math in Obsidian
Image2LaTEX
3 years ago by Hugo Persson
This is a plugin for obsidian that will read your latest copied image from clipboard and generate math latex from it
Image to text OCR
2 years ago by Dario Baumberger
Convert a image in your note to text.
Taskbone
5 years ago by Dominik Schlund
Obsidian OCR plugin - extract text from images
Omnisearch
4 years ago by Simon Cambier
A search engine that "just works" for Obsidian. Supports OCR and PDF indexing.
Vision Recall
a year ago by Travis Van Nimwegen
Transform screenshots into searchable Obsidian notes using AI vision and text analysis
Student Repo
a year ago by Feirong.zfr
学生知识库助手(Student Repository Helper)是一个面向学生或学生家长的Obsidian 插件,这款插件旨在解决学生在学习阶段面临的资料管理难题,将学习过程中产生的各类重要资料,如试卷、笔记、关键文档、绘画手工作品等,进行系统性的数字化整合与管理,并利用 AI 助手定期进行学习分析总结。随着时间的推移,它将助力你逐步搭建起一座专属你自己的知识宝库,这座宝库将伴随你一生,成为你知识成长与积累的见证。
Images to Notes
a year ago by Rodolfo Terriquez
Turn photos of your handwritten notes into markdown
Handwriting OCR
9 months ago by ikmolbo
Transform handwritten documents and scanned images into editable text with Handwriting OCR's AI-powered handwriting to text conversion.
AI Image OCR
8 months ago by Rootiest
Obsidian plugin for AI-powered text extraction from images
Content-Addressed Attachments
2 months ago by NateScarlet
Content-addressed attachment storage for automatic deduplication. Works entirely locally; optionally uses GitHub private repositories for hosting.