Want to leverage the power of the high-quality PDF library, MuPDF, directly in your web browser? While pre-built libraries might exist, building MuPDF from source for WebAssembly (Wasm) gives you ultimate control over features, optimizations, and the latest updates. It's also a fantastic learning experience!

This guide will walk you through the entire process:

  1. Setting up your environment.
  2. Cloning and building MuPDF's core libraries for Wasm using Emscripten.
  3. Writing a simple C wrapper to expose MuPDF functionality.
  4. Compiling your C code and linking it against MuPDF into a Wasm module.
  5. Applying aggressive optimizations to minimize the Wasm file size.
  6. Creating an HTML/JavaScript interface to interact with your MuPDF Wasm module.

Let's dive in!

Prerequisites

Before we start, make sure you have the following tools installed:

  1. Git: For cloning the MuPDF repository.
  2. Emscripten SDK: The toolchain for compiling C/C++ to WebAssembly. Follow the official installation guide: https://emscripten.org/docs/getting_started/downloads.html
  3. Make: MuPDF uses Makefiles for its build process (usually included with standard build tools on Linux/macOS, may need setup on Windows).
  4. Python: Required by the Emscripten SDK.

Step 1: Get the MuPDF Source Code

First, we need to download the MuPDF source code, including its dependencies (like FreeType, zlib, etc.) which are managed as Git submodules.

git clone --recursive https://github.com/ArtifexSoftware/mupdf.git
cd mupdf

Using --recursive is crucial to fetch the necessary third-party libraries.

Step 2: Activate the Emscripten Environment

In your terminal, navigate to the directory where you installed the Emscripten SDK and activate it for your current session:

# In your Emscripten SDK directory
./emsdk activate latest
source ./emsdk_env.sh

Verify the activation by running emcc -v. You should see Emscripten's version information.

Step 3: Build MuPDF Static Libraries for Wasm (Optimized)

Now, we'll compile MuPDF into static libraries (.a files) specifically for the WebAssembly target. We'll use emmake, an Emscripten wrapper around make, which ensures the correct compiler (emcc) and tools are used.

We will also apply optimizations and disable features we don't need in a typical Wasm library context, including GUI elements and, most significantly for size, the bundled fonts.

Warning: Disabling bundled fonts (HAVE_FONT_BASE14=no, HAVE_FONT_CJK=no) dramatically reduces file size but means text rendering will likely fail if the PDF documents you process do not embed all the fonts they use. Only do this if you don't need text rendering or are certain all fonts are embedded in your PDFs.

# Still inside the mupdf directory

# Build optimized release libraries, targeting Wasm, disabling GUI,
# and disabling bundled fonts for significant size reduction.
# Define the flags (optional, but keeps the command cleaner)
MUPDF_OPTS="-Os -DTOFU -DTOFU_CJK_EXT -DFZ_ENABLE_XPS=0 -DFZ_ENABLE_SVG=0 -DFZ_ENABLE_CBZ=0 -DFZ_ENABLE_IMG=0 -DFZ_ENABLE_HTML=0 -DFZ_ENABLE_EPUB=0 -DFZ_ENABLE_JS=0 -DFZ_ENABLE_OCR_OUTPUT=0 -DFZ_ENABLE_DOCX_OUTPUT=0 -DFZ_ENABLE_ODT_OUTPUT=0 -DMEMENTO_STACKTRACE_METHOD=0"

# Your emmake command incorporating the flags via XCFLAGS
emmake make build=release OS=wasm HAVE_X11=no HAVE_GLUT=no HAVE_GLFW=no XCFLAGS="$MUPDF_OPTS" libs

# Note: You can adjust HAVE_* flags further based on your needs.
# Check Makerules for more options. For example, if you don't need
# JBIG2 or JPEG2000 (JPX) image support, disable them. If you DO need
# encrypted PDF support, you might need HAVE_OPENSSL=yes (or HAVE_LIBCRYPTO).
# OPTIMIZE_FLAGS="-Oz" tells the library build to already optimize for size.

Explanation of Flags:

  • emmake: Ensures make uses Emscripten tools (emcc, emar).
  • build=release: Creates an optimized build.
  • OS=wasm: Specifies the target operating system/environment.
  • HAVE_X11=no, HAVE_GLUT=no, HAVE_GLFW=no: Disable desktop GUI dependencies.
  • HAVE_FONT_BASE14=no, HAVE_FONT_CJK=no: Crucial for size reduction. Disables bundling standard PDF fonts and CJK fallback fonts. Remember the text rendering caveat!
  • HAVE_JBIG2=no, HAVE_JPX=no: Disable specific image format support if not needed.
  • HAVE_OPENSSL=no: Disable support for encrypted PDFs if not needed.
  • OPTIMIZE_FLAGS="-Oz": Applies aggressive size optimization during the library build itself.
  • libs: A Make target often used to build just the necessary static libraries.

After this completes, you should find the compiled static libraries in the build/release/ subdirectory:

  • build/release/libmupdf.a
  • build/release/libmupdf-third.a

These contain the MuPDF code compiled to an intermediate format ready for linking into a Wasm module.

Step 4: Write Your C Wrapper (main.c)

We need a C file that will act as the bridge between JavaScript and the MuPDF library functions. Create a file named main.c (e.g., in the directory above the mupdf source folder).

This example C code defines a function get_pdf_page_count that takes a PDF file's data (as a buffer from JavaScript), uses MuPDF to open it, gets the page count, and returns it.

// main.c
#include <stdio.h>
#include <stdlib.h>
#include "mupdf/fitz.h"
#include <emscripten.h>

/**
 * Gets the page count of a PDF document from memory buffer
 * 
 * @param pdf_data Pointer to PDF data in Wasm memory
 * @param data_size Size of the PDF data in bytes
 * @return Page count or negative value on error
 */
EMSCRIPTEN_KEEPALIVE
int get_pdf_page_count(unsigned char *pdf_data, int data_size) {
    fz_context *ctx = NULL;
    fz_document *doc = NULL;
    fz_stream *stream = NULL;
    int page_count = -1;
    
    // Initialize MuPDF context
    if (!(ctx = fz_new_context(NULL, NULL, FZ_STORE_UNLIMITED))) {
        fprintf(stderr, "Cannot create MuPDF context\n");
        return -1;
    }
    
    // Process PDF document
    fz_try(ctx) {
        // Register document handlers
        fz_register_document_handlers(ctx);
        
        // Create stream from memory buffer
        stream = fz_open_memory(ctx, pdf_data, data_size);
        
        // Open document from stream
        doc = fz_open_document_with_stream(ctx, ".pdf", stream);
        
        // Get page count
        page_count = fz_count_pages(ctx, doc);
    }
    fz_catch(ctx) {
        fprintf(stderr, "Error processing PDF: %s\n", fz_caught_message(ctx));
        page_count = -2;
    }
    
    // Clean up resources
    if (doc) fz_drop_document(ctx, doc);
    if (stream) fz_drop_stream(ctx, stream);
    if (ctx) fz_drop_context(ctx);
    
    printf("PDF page count: %d\n", page_count);
    return page_count;
}

// Main function for standalone testing
int main() {
    printf("Wasm module loaded. Call get_pdf_page_count() from JavaScript.\n");
    return 0;
}

Now, the crucial step: compile main.c and link it with the MuPDF static libraries (libmupdf.a, libmupdf-third.a) into the final main.js (JavaScript glue code) and main.wasm (WebAssembly module). We'll use aggressive optimization flags.

Make sure you run this command from the directory containing main.c (which should typically be the parent directory of the mupdf folder).

# In the directory containing main.c (parent of mupdf/)

emcc main.c \
     mupdf/build/release/libmupdf.a \
     mupdf/build/release/libmupdf-third.a \
     -o main.js \
     -I mupdf/include \
     -s WASM=1 \
     -s EXPORTED_FUNCTIONS="['_get_pdf_page_count', '_malloc', '_free']" \
     -s EXPORTED_RUNTIME_METHODS="['ccall', 'cwrap', 'HEAPU8']" \
     -s ALLOW_MEMORY_GROWTH=1 \
     -s MODULARIZE=1 \
     -s EXPORT_NAME="'createMyMuPDFModule'" \
     -s ASSERTIONS=0 \
     -Oz \
     -flto

Explanation of emcc Flags:

  • main.c: Your C wrapper source file.
  • mupdf/build/release/libmupdf.a, mupdf/build/release/libmupdf-third.a: The static libraries we built earlier.
  • -o main.js: Output base name. Generates main.js and main.wasm.
  • -I mupdf/include: Tells the compiler where to find MuPDF headers (mupdf/fitz.h).
  • -s WASM=1: Ensure Wasm output (usually default).
  • -s EXPORTED_FUNCTIONS="['_get_pdf_page_count', '_malloc', '_free']": Explicitly lists C functions callable from JavaScript. We need our function, plus malloc and free for memory management between JS and Wasm. The leading underscore is the C naming convention Emscripten uses.
  • -s EXPORTED_RUNTIME_METHODS="['ccall', 'cwrap']": Exports convenient Emscripten helper functions for calling C from JS.
  • -s ALLOW_MEMORY_GROWTH=1: Allows the Wasm module's memory heap to expand if needed (PDFs can be large).
  • -s MODULARIZE=1 -s EXPORT_NAME="'createMyMuPDFModule'": Wraps the generated JS in a factory function (createMyMuPDFModule). This avoids polluting the global scope and provides a clean way to instantiate the module using a Promise.
  • -s ASSERTIONS=0: Optimization. Disables runtime checks and assertions in Emscripten code, reducing size. Only use for release builds.
  • -Oz: Optimization. Optimize aggressively for code size.
  • -flto: Optimization. Enables Link-Time Optimization. Allows the optimizer to see across all C files and libraries during the final linking stage, enabling much better dead code removal and inlining. Significantly increases link time but often yields large size reductions.

This command might take a while, especially with -flto. Once done, you'll have main.js and main.wasm.

Step 6: Create the HTML Interface (index.html)

Now, create an HTML file to load and interact with your Wasm module. This file provides a file input, calls the exported C function, and displays the result.

<!DOCTYPE html>
<html>
<head>
    <meta charset="utf-8">
    <title>MuPDF Wasm Test</title>
    <style>
        body { font-family: sans-serif; line-height: 1.5; padding: 1em; }
        #status, #result { font-weight: bold; }
        #error { color: red; font-weight: bold; margin-top: 1em;}
        input[type="file"] { margin-top: 10px; margin-bottom: 10px; display: block; }
        pre { background-color: #eee; padding: 10px; border-radius: 4px; white-space: pre-wrap; word-wrap: break-word; }
    </style>
</head>
<body>
    <h1>MuPDF WebAssembly Test</h1>
    <p>Build MuPDF from source for Wasm, call a C function to get the page count of a selected PDF.</p>
    <p><em>Note: Built without bundled fonts for smaller size. Text rendering in complex PDFs might fail if fonts are not embedded in the PDF itself.</em></p>

    <input type="file" id="pdf-input" accept=".pdf">
    <div>Status: <span id="status">Loading Wasm module...</span></div>
    <div>Result: <span id="result">--</span></div>
    <div id="error"></div>

    <h2>Browser Console Logs:</h2>
    <pre id="console-log">Logs will appear here...\n</pre>

    <!-- Include the generated JS glue code -->
    <script src="main.js"></script>

    <script>
        const pdfInput = document.getElementById('pdf-input');
        const statusSpan = document.getElementById('status');
        const resultSpan = document.getElementById('result');
        const errorSpan = document.getElementById('error');
        const consoleLog = document.getElementById('console-log');

        let myModule = null; // To hold the loaded Wasm module instance

        // --- Log C's printf/stderr to the page ---
        function logToPage(text) {
            consoleLog.textContent += text + "\n";
            // Auto-scroll
            consoleLog.scrollTop = consoleLog.scrollHeight;
        }

        // --- Initialize Wasm Module ---
        pdfInput.disabled = true; // Disable input until module is ready
        consoleLog.textContent = "Initializing Emscripten module...\n";

        // Use the factory function created by EXPORT_NAME
        createMyMuPDFModule({
            // Redirect stdout and stderr from C/Wasm to our logging function
            print: logToPage,
            printErr: logToPage,
        }).then(Module => {
            logToPage("Emscripten module loaded successfully.");
            console.log("MuPDF Wasm module instance:", Module);
            myModule = Module; // Store the instance
            statusSpan.textContent = "Module ready. Select a PDF file.";
            pdfInput.disabled = false; // Enable the file input
        }).catch(e => {
             logToPage("ERROR loading Wasm module: " + e);
             console.error("Error loading Wasm module:", e);
             statusSpan.textContent = "Error loading Wasm module!";
             errorSpan.textContent = "Failed to initialize the PDF processing engine: " + e;
        });

        // --- Handle File Input ---
        pdfInput.addEventListener('change', (event) => {
            const file = event.target.files[0];
            if (!file) return; // No file selected
            if (!myModule) { // Should not happen if input is enabled correctly
                 errorSpan.textContent = "Error: Wasm module not ready.";
                 logToPage("Error: File selected but Wasm module not ready.");
                 return;
            }

            statusSpan.textContent = `Reading file: ${file.name}...`;
            resultSpan.textContent = "--"; // Reset result
            errorSpan.textContent = "";   // Clear previous errors
            logToPage(`--------------------\nSelected file: ${file.name} (${(file.size / 1024).toFixed(1)} KB)`);

            const reader = new FileReader();

            reader.onload = function(e) {
                statusSpan.textContent = `Processing ${file.name}...`;
                logToPage("File read completed. Preparing data for Wasm...");
                const pdfData = new Uint8Array(e.target.result);
                let dataPtr = 0; // Pointer in Wasm memory

                try {
                    // 1. Allocate memory in the Wasm heap
                    logToPage(`Allocating ${pdfData.length} bytes in Wasm heap...`);
                    dataPtr = myModule._malloc(pdfData.length);
                    if (dataPtr === 0) {
                        throw new Error("Failed to allocate memory in Wasm heap (_malloc returned 0).");
                    }
                    logToPage(`Allocated memory at address: ${dataPtr}`);

                    // 2. Copy PDF data from JS ArrayBuffer to Wasm heap
                    logToPage("Copying PDF data to Wasm heap...");
                    // Module.HEAPU8 provides a view into the Wasm memory as bytes
                    myModule.HEAPU8.set(pdfData, dataPtr);
                    logToPage("Data copied.");

                    // 3. Call the exported C function via ccall
                    logToPage(`Calling C function: _get_pdf_page_count(ptr=${dataPtr}, size=${pdfData.length})...`);
                    // ccall signature: functionName, returnType, argTypes, args
                    const pageCount = myModule.ccall(
                        'get_pdf_page_count', // C function name (with underscore prepended by Emscripten)
                        'number',            // Return type (int -> number)
                        ['number', 'number'], // Argument types (pointer -> number, int -> number)
                        [dataPtr, pdfData.length] // Arguments
                    );
                    logToPage(`C function returned: ${pageCount}`);

                    // 4. Process the result
                    if (pageCount >= 0) {
                        resultSpan.textContent = `${pageCount} pages`;
                        statusSpan.textContent = "Processing complete.";
                        logToPage("Successfully retrieved page count.");
                    } else {
                         // Handle specific error codes from C
                         throw new Error(`MuPDF C function failed (returned code: ${pageCount}). Check logs above.`);
                    }

                } catch (err) {
                     console.error("Error during Wasm execution:", err);
                     logToPage("ERROR during Wasm execution: " + err.message);
                     errorSpan.textContent = `Runtime Error: ${err.message}`;
                     statusSpan.textContent = "Processing failed.";
                     resultSpan.textContent = "--";
                } finally {
                    // 5. IMPORTANT: Free the allocated memory in Wasm heap
                    if (dataPtr !== 0) {
                        logToPage(`Freeing Wasm memory at address: ${dataPtr}...`);
                        myModule._free(dataPtr);
                        logToPage("Memory freed.");
                    }
                    logToPage("Finished processing request.");
                }
            }; // end reader.onload

            reader.onerror = function() {
                 const errorMsg = `Error reading file: ${reader.error}`;
                 console.error(errorMsg);
                 logToPage("ERROR reading file: " + reader.error);
                 errorSpan.textContent = errorMsg;
                 statusSpan.textContent = "File reading failed.";
                 resultSpan.textContent = "--";
            };

            // Start reading the file
            reader.readAsArrayBuffer(file);
        }); // end event listener
    </script>
</body>
</html>

Key parts of the JavaScript:

  1. Module Loading: createMyMuPDFModule({...}).then(...) asynchronously loads and initializes the Wasm module. The file input is disabled until this completes. We redirect print and printErr to see printf/stderr from C.
  2. File Reading: FileReader reads the selected PDF as an ArrayBuffer.
  3. Memory Management (JS -> Wasm):
    • myModule._malloc(size) allocates memory inside the Wasm module's heap.
    • myModule.HEAPU8.set(data, ptr) copies the byte data from the JavaScript Uint8Array into the allocated Wasm memory location.
    • myModule._free(ptr) releases the memory allocated in Wasm. This is crucial to prevent memory leaks! Use a try...finally block to ensure _free is called even if errors occur.
  4. Calling C: myModule.ccall(funcName, returnType, argTypes, args) invokes the exported C function (_get_pdf_page_count).
  5. Error Handling: Checks return values and uses try...catch to handle potential runtime errors during Wasm interaction.

Step 7: Run the Example

WebAssembly modules usually require being served over HTTP(S) due to browser security restrictions (CORS, MIME types). You can't just open the index.html file directly using file://.

  1. Make sure index.html, main.js, and main.wasm are in the same directory.
  2. Open your terminal in that directory.
  3. Start a simple local web server. Examples:
    • Using Python 3: python -m http.server 8000
    • Using Node.js (requires http-server): npm install -g http-server (if needed), then http-server . -p 8000
  4. Open your web browser and navigate to http://localhost:8000.

You should see the page load, the Wasm module initialize (check the status and console log area), and then you can select a PDF file to get its page count!

Key Takeaways & Further Steps

  • Build from Source: Gives you control over features and optimizations.
  • Emscripten: The bridge between C/C++ and WebAssembly. emmake and emcc are key tools.
  • Static Libraries: Complex projects are often built as static libraries (.a) first, then linked into the final Wasm module.
  • Size Optimization: -Oz, -flto, -s ASSERTIONS=0, and disabling unused features (especially bundled fonts via HAVE_FONT_*=no) are critical for reducing Wasm file size. Remember the trade-offs!
  • JS/Wasm Interaction: Requires careful memory management (_malloc, HEAPU8.set, _free) and using Emscripten's calling conventions (ccall/cwrap).
  • Server Compression: Configure your web server to serve .wasm files with Gzip or Brotli compression for significantly faster downloads.

From here, you could extend the C wrapper (main.c) to expose more MuPDF functionality like rendering pages to images, extracting text, or handling annotations, rebuilding and relinking as needed. Happy coding!