Skip to content

Fix: Embedding API Validation Error

Date: 2026-02-10
Issue: Pipeline processing failed: Embedding error: API error: ‘$.input’ is invalid
Status: ✅ Fixed
Commit: 5b6bcd6a

When processing documents, the pipeline occasionally failed with:

Pipeline processing failed: Embedding error: API error: '$.input' is invalid. Please check the AP...

This error occurred when embedding providers (OpenAI, Ollama, etc.) received arrays containing empty or whitespace-only strings. API validation rejected these invalid inputs.

The embedding pipeline was passing all text strings to the API without filtering, including:

  • Empty strings ("")
  • Whitespace-only strings (" ", "\n", "\t")
  • Strings that became empty after .trim()

External APIs (OpenAI, Gemini, Jina, etc.) validate input and reject empty strings in the input array.

All embedding providers now:

  1. Filter invalid inputs before API calls:

    let valid_texts: Vec<(usize, &String)> = texts
    .iter()
    .enumerate()
    .filter(|(_, text)| !text.trim().is_empty())
    .collect();
  2. Handle all-empty case gracefully:

    if valid_texts.is_empty() {
    return Ok(vec![vec![0.0; self.embedding_dimension]; texts.len()]);
    }
  3. Map results back to original indices:

    let mut result = vec![vec![0.0; self.embedding_dimension]; texts.len()];
    for ((orig_idx, _), embedding) in valid_texts.iter().zip(api_embeddings) {
    result[*orig_idx] = embedding;
    }

All embedding providers were updated:

  • ✅ OpenAI (openai.rs)
  • ✅ Ollama (ollama.rs)
  • ✅ Gemini (gemini.rs)
  • ✅ Jina (jina.rs)
  • ✅ Azure OpenAI (azure_openai.rs)
  • ✅ LM Studio (lmstudio.rs)
  • ✅ Mock Provider (mock.rs)

All 201 tests pass:

Terminal window
cd edgequake
cargo test --package edgequake-llm --lib
# Result: ok. 201 passed; 0 failed; 0 ignored
  1. Start the backend:

    Terminal window
    make dev
  2. Upload a problematic PDF document

  3. Verify the document processes successfully without embedding errors

  4. Check backend logs:

    Terminal window
    tail -f /tmp/edgequake-backend.log

Expected: No “Embedding error” messages, document status shows “Completed”

Input CaseBehavior
All strings validNormal processing, all strings embedded
Some strings emptyEmpty strings get zero vectors, others processed normally
All strings emptyReturn array of zero vectors (dimension-matched)
Whitespace-onlyTreated as empty, receives zero vector
Mixed valid/invalidValid strings embedded, invalid get zero vectors
  • Negligible overhead: One additional filter() pass over input array
  • API call reduction: Fewer strings sent to API when some are empty
  • Consistency: Output array size always matches input array size

Clippy: No warnings
Tests: All 201 tests pass
Consistency: All providers use same pattern

Consider:

  1. Log warning when many empty strings are filtered (potential data quality issue)
  2. Add telemetry to track how often filtering occurs
  3. Upstream validation in chunking/extraction to prevent empty strings earlier

This fix prevents:

  • OpenAI API errors: $.input is invalid
  • Ollama API errors: invalid input
  • Gemini API errors: empty text not allowed
  • All providers filter empty strings
  • Results mapped back to correct indices
  • Zero vectors returned for empty inputs
  • Array size consistency maintained
  • Tests pass
  • Clippy clean
  • Documentation updated