Building MuPDF for WebAssembly from Source: A Detailed Guide
Want to leverage the power of the high-quality PDF library, MuPDF, directly in your web browser? While pre-built libraries might exist, building MuPDF from source for WebAssembly (Wasm) gives you ultimate control over features, optimizations, and the latest updates. It's also a fantastic learning experience!
This guide will walk you through the entire process:
- Setting up your environment.
- Cloning and building MuPDF's core libraries for Wasm using Emscripten.
- Writing a simple C wrapper to expose MuPDF functionality.
- Compiling your C code and linking it against MuPDF into a Wasm module.
- Applying aggressive optimizations to minimize the Wasm file size.
- Creating an HTML/JavaScript interface to interact with your MuPDF Wasm module.
Let's dive in!
Prerequisites
Before we start, make sure you have the following tools installed:
- Git: For cloning the MuPDF repository.
- Emscripten SDK: The toolchain for compiling C/C++ to WebAssembly. Follow the official installation guide: https://emscripten.org/docs/getting_started/downloads.html
- Make: MuPDF uses Makefiles for its build process (usually included with standard build tools on Linux/macOS, may need setup on Windows).
- Python: Required by the Emscripten SDK.
Step 1: Get the MuPDF Source Code
First, we need to download the MuPDF source code, including its dependencies (like FreeType, zlib, etc.) which are managed as Git submodules.
git clone --recursive https://github.com/ArtifexSoftware/mupdf.git
cd mupdf
Using --recursive
is crucial to fetch the necessary third-party libraries.
Step 2: Activate the Emscripten Environment
In your terminal, navigate to the directory where you installed the Emscripten SDK and activate it for your current session:
# In your Emscripten SDK directory
./emsdk activate latest
source ./emsdk_env.sh
Verify the activation by running emcc -v
. You should see Emscripten's version information.
Step 3: Build MuPDF Static Libraries for Wasm (Optimized)
Now, we'll compile MuPDF into static libraries (.a
files) specifically for the WebAssembly target. We'll use emmake
, an Emscripten wrapper around make
, which ensures the correct compiler (emcc
) and tools are used.
We will also apply optimizations and disable features we don't need in a typical Wasm library context, including GUI elements and, most significantly for size, the bundled fonts.
Warning: Disabling bundled fonts (HAVE_FONT_BASE14=no
, HAVE_FONT_CJK=no
) dramatically reduces file size but means text rendering will likely fail if the PDF documents you process do not embed all the fonts they use. Only do this if you don't need text rendering or are certain all fonts are embedded in your PDFs.
# Still inside the mupdf directory
# Build optimized release libraries, targeting Wasm, disabling GUI,
# and disabling bundled fonts for significant size reduction.
# Define the flags (optional, but keeps the command cleaner)
MUPDF_OPTS="-Os -DTOFU -DTOFU_CJK_EXT -DFZ_ENABLE_XPS=0 -DFZ_ENABLE_SVG=0 -DFZ_ENABLE_CBZ=0 -DFZ_ENABLE_IMG=0 -DFZ_ENABLE_HTML=0 -DFZ_ENABLE_EPUB=0 -DFZ_ENABLE_JS=0 -DFZ_ENABLE_OCR_OUTPUT=0 -DFZ_ENABLE_DOCX_OUTPUT=0 -DFZ_ENABLE_ODT_OUTPUT=0 -DMEMENTO_STACKTRACE_METHOD=0"
# Your emmake command incorporating the flags via XCFLAGS
emmake make build=release OS=wasm HAVE_X11=no HAVE_GLUT=no HAVE_GLFW=no XCFLAGS="$MUPDF_OPTS" libs
# Note: You can adjust HAVE_* flags further based on your needs.
# Check Makerules for more options. For example, if you don't need
# JBIG2 or JPEG2000 (JPX) image support, disable them. If you DO need
# encrypted PDF support, you might need HAVE_OPENSSL=yes (or HAVE_LIBCRYPTO).
# OPTIMIZE_FLAGS="-Oz" tells the library build to already optimize for size.
Explanation of Flags:
emmake
: Ensuresmake
uses Emscripten tools (emcc
,emar
).build=release
: Creates an optimized build.OS=wasm
: Specifies the target operating system/environment.HAVE_X11=no
,HAVE_GLUT=no
,HAVE_GLFW=no
: Disable desktop GUI dependencies.HAVE_FONT_BASE14=no
,HAVE_FONT_CJK=no
: Crucial for size reduction. Disables bundling standard PDF fonts and CJK fallback fonts. Remember the text rendering caveat!HAVE_JBIG2=no
,HAVE_JPX=no
: Disable specific image format support if not needed.HAVE_OPENSSL=no
: Disable support for encrypted PDFs if not needed.OPTIMIZE_FLAGS="-Oz"
: Applies aggressive size optimization during the library build itself.libs
: A Make target often used to build just the necessary static libraries.
After this completes, you should find the compiled static libraries in the build/release/
subdirectory:
build/release/libmupdf.a
build/release/libmupdf-third.a
These contain the MuPDF code compiled to an intermediate format ready for linking into a Wasm module.
Step 4: Write Your C Wrapper (main.c
)
We need a C file that will act as the bridge between JavaScript and the MuPDF library functions. Create a file named main.c
(e.g., in the directory above the mupdf
source folder).
This example C code defines a function get_pdf_page_count
that takes a PDF file's data (as a buffer from JavaScript), uses MuPDF to open it, gets the page count, and returns it.
// main.c
#include <stdio.h>
#include <stdlib.h>
#include "mupdf/fitz.h"
#include <emscripten.h>
/**
* Gets the page count of a PDF document from memory buffer
*
* @param pdf_data Pointer to PDF data in Wasm memory
* @param data_size Size of the PDF data in bytes
* @return Page count or negative value on error
*/
EMSCRIPTEN_KEEPALIVE
int get_pdf_page_count(unsigned char *pdf_data, int data_size) {
fz_context *ctx = NULL;
fz_document *doc = NULL;
fz_stream *stream = NULL;
int page_count = -1;
// Initialize MuPDF context
if (!(ctx = fz_new_context(NULL, NULL, FZ_STORE_UNLIMITED))) {
fprintf(stderr, "Cannot create MuPDF context\n");
return -1;
}
// Process PDF document
fz_try(ctx) {
// Register document handlers
fz_register_document_handlers(ctx);
// Create stream from memory buffer
stream = fz_open_memory(ctx, pdf_data, data_size);
// Open document from stream
doc = fz_open_document_with_stream(ctx, ".pdf", stream);
// Get page count
page_count = fz_count_pages(ctx, doc);
}
fz_catch(ctx) {
fprintf(stderr, "Error processing PDF: %s\n", fz_caught_message(ctx));
page_count = -2;
}
// Clean up resources
if (doc) fz_drop_document(ctx, doc);
if (stream) fz_drop_stream(ctx, stream);
if (ctx) fz_drop_context(ctx);
printf("PDF page count: %d\n", page_count);
return page_count;
}
// Main function for standalone testing
int main() {
printf("Wasm module loaded. Call get_pdf_page_count() from JavaScript.\n");
return 0;
}
Step 5: Compile and Link with Emscripten (Optimized)
Now, the crucial step: compile main.c
and link it with the MuPDF static libraries (libmupdf.a
, libmupdf-third.a
) into the final main.js
(JavaScript glue code) and main.wasm
(WebAssembly module). We'll use aggressive optimization flags.
Make sure you run this command from the directory containing main.c
(which should typically be the parent directory of the mupdf
folder).
# In the directory containing main.c (parent of mupdf/)
emcc main.c \
mupdf/build/release/libmupdf.a \
mupdf/build/release/libmupdf-third.a \
-o main.js \
-I mupdf/include \
-s WASM=1 \
-s EXPORTED_FUNCTIONS="['_get_pdf_page_count', '_malloc', '_free']" \
-s EXPORTED_RUNTIME_METHODS="['ccall', 'cwrap', 'HEAPU8']" \
-s ALLOW_MEMORY_GROWTH=1 \
-s MODULARIZE=1 \
-s EXPORT_NAME="'createMyMuPDFModule'" \
-s ASSERTIONS=0 \
-Oz \
-flto
Explanation of emcc
Flags:
main.c
: Your C wrapper source file.mupdf/build/release/libmupdf.a
,mupdf/build/release/libmupdf-third.a
: The static libraries we built earlier.-o main.js
: Output base name. Generatesmain.js
andmain.wasm
.-I mupdf/include
: Tells the compiler where to find MuPDF headers (mupdf/fitz.h
).-s WASM=1
: Ensure Wasm output (usually default).-s EXPORTED_FUNCTIONS="['_get_pdf_page_count', '_malloc', '_free']"
: Explicitly lists C functions callable from JavaScript. We need our function, plusmalloc
andfree
for memory management between JS and Wasm. The leading underscore is the C naming convention Emscripten uses.-s EXPORTED_RUNTIME_METHODS="['ccall', 'cwrap']"
: Exports convenient Emscripten helper functions for calling C from JS.-s ALLOW_MEMORY_GROWTH=1
: Allows the Wasm module's memory heap to expand if needed (PDFs can be large).-s MODULARIZE=1 -s EXPORT_NAME="'createMyMuPDFModule'"
: Wraps the generated JS in a factory function (createMyMuPDFModule
). This avoids polluting the global scope and provides a clean way to instantiate the module using a Promise.-s ASSERTIONS=0
: Optimization. Disables runtime checks and assertions in Emscripten code, reducing size. Only use for release builds.-Oz
: Optimization. Optimize aggressively for code size.-flto
: Optimization. Enables Link-Time Optimization. Allows the optimizer to see across all C files and libraries during the final linking stage, enabling much better dead code removal and inlining. Significantly increases link time but often yields large size reductions.
This command might take a while, especially with -flto
. Once done, you'll have main.js
and main.wasm
.
Step 6: Create the HTML Interface (index.html
)
Now, create an HTML file to load and interact with your Wasm module. This file provides a file input, calls the exported C function, and displays the result.
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<title>MuPDF Wasm Test</title>
<style>
body { font-family: sans-serif; line-height: 1.5; padding: 1em; }
#status, #result { font-weight: bold; }
#error { color: red; font-weight: bold; margin-top: 1em;}
input[type="file"] { margin-top: 10px; margin-bottom: 10px; display: block; }
pre { background-color: #eee; padding: 10px; border-radius: 4px; white-space: pre-wrap; word-wrap: break-word; }
</style>
</head>
<body>
<h1>MuPDF WebAssembly Test</h1>
<p>Build MuPDF from source for Wasm, call a C function to get the page count of a selected PDF.</p>
<p><em>Note: Built without bundled fonts for smaller size. Text rendering in complex PDFs might fail if fonts are not embedded in the PDF itself.</em></p>
<input type="file" id="pdf-input" accept=".pdf">
<div>Status: <span id="status">Loading Wasm module...</span></div>
<div>Result: <span id="result">--</span></div>
<div id="error"></div>
<h2>Browser Console Logs:</h2>
<pre id="console-log">Logs will appear here...\n</pre>
<!-- Include the generated JS glue code -->
<script src="main.js"></script>
<script>
const pdfInput = document.getElementById('pdf-input');
const statusSpan = document.getElementById('status');
const resultSpan = document.getElementById('result');
const errorSpan = document.getElementById('error');
const consoleLog = document.getElementById('console-log');
let myModule = null; // To hold the loaded Wasm module instance
// --- Log C's printf/stderr to the page ---
function logToPage(text) {
consoleLog.textContent += text + "\n";
// Auto-scroll
consoleLog.scrollTop = consoleLog.scrollHeight;
}
// --- Initialize Wasm Module ---
pdfInput.disabled = true; // Disable input until module is ready
consoleLog.textContent = "Initializing Emscripten module...\n";
// Use the factory function created by EXPORT_NAME
createMyMuPDFModule({
// Redirect stdout and stderr from C/Wasm to our logging function
print: logToPage,
printErr: logToPage,
}).then(Module => {
logToPage("Emscripten module loaded successfully.");
console.log("MuPDF Wasm module instance:", Module);
myModule = Module; // Store the instance
statusSpan.textContent = "Module ready. Select a PDF file.";
pdfInput.disabled = false; // Enable the file input
}).catch(e => {
logToPage("ERROR loading Wasm module: " + e);
console.error("Error loading Wasm module:", e);
statusSpan.textContent = "Error loading Wasm module!";
errorSpan.textContent = "Failed to initialize the PDF processing engine: " + e;
});
// --- Handle File Input ---
pdfInput.addEventListener('change', (event) => {
const file = event.target.files[0];
if (!file) return; // No file selected
if (!myModule) { // Should not happen if input is enabled correctly
errorSpan.textContent = "Error: Wasm module not ready.";
logToPage("Error: File selected but Wasm module not ready.");
return;
}
statusSpan.textContent = `Reading file: ${file.name}...`;
resultSpan.textContent = "--"; // Reset result
errorSpan.textContent = ""; // Clear previous errors
logToPage(`--------------------\nSelected file: ${file.name} (${(file.size / 1024).toFixed(1)} KB)`);
const reader = new FileReader();
reader.onload = function(e) {
statusSpan.textContent = `Processing ${file.name}...`;
logToPage("File read completed. Preparing data for Wasm...");
const pdfData = new Uint8Array(e.target.result);
let dataPtr = 0; // Pointer in Wasm memory
try {
// 1. Allocate memory in the Wasm heap
logToPage(`Allocating ${pdfData.length} bytes in Wasm heap...`);
dataPtr = myModule._malloc(pdfData.length);
if (dataPtr === 0) {
throw new Error("Failed to allocate memory in Wasm heap (_malloc returned 0).");
}
logToPage(`Allocated memory at address: ${dataPtr}`);
// 2. Copy PDF data from JS ArrayBuffer to Wasm heap
logToPage("Copying PDF data to Wasm heap...");
// Module.HEAPU8 provides a view into the Wasm memory as bytes
myModule.HEAPU8.set(pdfData, dataPtr);
logToPage("Data copied.");
// 3. Call the exported C function via ccall
logToPage(`Calling C function: _get_pdf_page_count(ptr=${dataPtr}, size=${pdfData.length})...`);
// ccall signature: functionName, returnType, argTypes, args
const pageCount = myModule.ccall(
'get_pdf_page_count', // C function name (with underscore prepended by Emscripten)
'number', // Return type (int -> number)
['number', 'number'], // Argument types (pointer -> number, int -> number)
[dataPtr, pdfData.length] // Arguments
);
logToPage(`C function returned: ${pageCount}`);
// 4. Process the result
if (pageCount >= 0) {
resultSpan.textContent = `${pageCount} pages`;
statusSpan.textContent = "Processing complete.";
logToPage("Successfully retrieved page count.");
} else {
// Handle specific error codes from C
throw new Error(`MuPDF C function failed (returned code: ${pageCount}). Check logs above.`);
}
} catch (err) {
console.error("Error during Wasm execution:", err);
logToPage("ERROR during Wasm execution: " + err.message);
errorSpan.textContent = `Runtime Error: ${err.message}`;
statusSpan.textContent = "Processing failed.";
resultSpan.textContent = "--";
} finally {
// 5. IMPORTANT: Free the allocated memory in Wasm heap
if (dataPtr !== 0) {
logToPage(`Freeing Wasm memory at address: ${dataPtr}...`);
myModule._free(dataPtr);
logToPage("Memory freed.");
}
logToPage("Finished processing request.");
}
}; // end reader.onload
reader.onerror = function() {
const errorMsg = `Error reading file: ${reader.error}`;
console.error(errorMsg);
logToPage("ERROR reading file: " + reader.error);
errorSpan.textContent = errorMsg;
statusSpan.textContent = "File reading failed.";
resultSpan.textContent = "--";
};
// Start reading the file
reader.readAsArrayBuffer(file);
}); // end event listener
</script>
</body>
</html>
Key parts of the JavaScript:
- Module Loading:
createMyMuPDFModule({...}).then(...)
asynchronously loads and initializes the Wasm module. The file input is disabled until this completes. We redirectprint
andprintErr
to seeprintf
/stderr
from C. - File Reading:
FileReader
reads the selected PDF as anArrayBuffer
. - Memory Management (JS -> Wasm):
myModule._malloc(size)
allocates memory inside the Wasm module's heap.myModule.HEAPU8.set(data, ptr)
copies the byte data from the JavaScriptUint8Array
into the allocated Wasm memory location.myModule._free(ptr)
releases the memory allocated in Wasm. This is crucial to prevent memory leaks! Use atry...finally
block to ensure_free
is called even if errors occur.
- Calling C:
myModule.ccall(funcName, returnType, argTypes, args)
invokes the exported C function (_get_pdf_page_count
). - Error Handling: Checks return values and uses
try...catch
to handle potential runtime errors during Wasm interaction.
Step 7: Run the Example
WebAssembly modules usually require being served over HTTP(S) due to browser security restrictions (CORS, MIME types). You can't just open the index.html
file directly using file://
.
- Make sure
index.html
,main.js
, andmain.wasm
are in the same directory. - Open your terminal in that directory.
- Start a simple local web server. Examples:
- Using Python 3:
python -m http.server 8000
- Using Node.js (requires
http-server
):npm install -g http-server
(if needed), thenhttp-server . -p 8000
- Using Python 3:
- Open your web browser and navigate to
http://localhost:8000
.
You should see the page load, the Wasm module initialize (check the status and console log area), and then you can select a PDF file to get its page count!
Key Takeaways & Further Steps
- Build from Source: Gives you control over features and optimizations.
- Emscripten: The bridge between C/C++ and WebAssembly.
emmake
andemcc
are key tools. - Static Libraries: Complex projects are often built as static libraries (
.a
) first, then linked into the final Wasm module. - Size Optimization:
-Oz
,-flto
,-s ASSERTIONS=0
, and disabling unused features (especially bundled fonts viaHAVE_FONT_*=no
) are critical for reducing Wasm file size. Remember the trade-offs! - JS/Wasm Interaction: Requires careful memory management (
_malloc
,HEAPU8.set
,_free
) and using Emscripten's calling conventions (ccall
/cwrap
). - Server Compression: Configure your web server to serve
.wasm
files with Gzip or Brotli compression for significantly faster downloads.
From here, you could extend the C wrapper (main.c
) to expose more MuPDF functionality like rendering pages to images, extracting text, or handling annotations, rebuilding and relinking as needed. Happy coding!