處理大型文件時,改善相關性

    Meilisearch 針對處理段落大小的文本區塊進行了最佳化。如果資料集包含大量含有大量文本的文件,可能會導致搜尋結果的相關性降低。

    在本指南中,您將看到如何使用 JavaScript 和 Node.js 來分割單一大型文件,並使用不同的屬性來設定 Meilisearch,以防止出現重複的結果。

    需求

    資料集

    stories.json 包含兩個文件,每個文件在其 text 欄位中儲存一個短篇故事的完整文本

    [
      {
        "id": 0,
        "title": "A Haunted House",
        "author": "Virginia Woolf",
        "text": "Whatever hour you woke there was a door shutting. From room to room they went, hand in hand, lifting here, opening there, making sure—a ghostly couple.\n\n \"Here we left it,\" she said. And he added, \"Oh, but here too!\" \"It's upstairs,\" she murmured. \"And in the garden,\" he whispered. \"Quietly,\" they said, \"or we shall wake them.\"\n\nBut it wasn't that you woke us. Oh, no. \"They're looking for it; they're drawing the curtain,\" one might say, and so read on a page or two. \"Now they've found it,\" one would be certain, stopping the pencil on the margin. And then, tired of reading, one might rise and see for oneself, the house all empty, the doors standing open, only the wood pigeons bubbling with content and the hum of the threshing machine sounding from the farm. \"What did I come in here for? What did I want to find?\" My hands were empty. \"Perhaps it's upstairs then?\" The apples were in the loft. And so down again, the garden still as ever, only the book had slipped into the grass.\n\nBut they had found it in the drawing room. Not that one could ever see them. The window panes reflected apples, reflected roses; all the leaves were green in the glass. If they moved in the drawing room, the apple only turned its yellow side. Yet, the moment after, if the door was opened, spread about the floor, hung upon the walls, pendant from the ceiling—what? My hands were empty. The shadow of a thrush crossed the carpet; from the deepest wells of silence the wood pigeon drew its bubble of sound. \"Safe, safe, safe,\" the pulse of the house beat softly. \"The treasure buried; the room ...\" the pulse stopped short. Oh, was that the buried treasure?\n\nA moment later the light had faded. Out in the garden then? But the trees spun darkness for a wandering beam of sun. So fine, so rare, coolly sunk beneath the surface the beam I sought always burnt behind the glass. Death was the glass; death was between us; coming to the woman first, hundreds of years ago, leaving the house, sealing all the windows; the rooms were darkened. He left it, left her, went North, went East, saw the stars turned in the Southern sky; sought the house, found it dropped beneath the Downs. \"Safe, safe, safe,\" the pulse of the house beat gladly. \"The Treasure yours.\"\n\nThe wind roars up the avenue. Trees stoop and bend this way and that. Moonbeams splash and spill wildly in the rain. But the beam of the lamp falls straight from the window. The candle burns stiff and still. Wandering through the house, opening the windows, whispering not to wake us, the ghostly couple seek their joy.\n\n\"Here we slept,\" she says. And he adds, \"Kisses without number.\" \"Waking in the morning—\" \"Silver between the trees—\" \"Upstairs—\" \"In the garden—\" \"When summer came—\" \"In winter snowtime—\" The doors go shutting far in the distance, gently knocking like the pulse of a heart.\n\nNearer they come; cease at the doorway. The wind falls, the rain slides silver down the glass. Our eyes darken; we hear no steps beside us; we see no lady spread her ghostly cloak. His hands shield the lantern. \"Look,\" he breathes. \"Sound asleep. Love upon their lips.\"\n\nStooping, holding their silver lamp above us, long they look and deeply. Long they pause. The wind drives straightly; the flame stoops slightly. Wild beams of moonlight cross both floor and wall, and, meeting, stain the faces bent; the faces pondering; the faces that search the sleepers and seek their hidden joy.\n\n\"Safe, safe, safe,\" the heart of the house beats proudly. \"Long years—\" he sighs. \"Again you found me.\" \"Here,\" she murmurs, \"sleeping; in the garden reading; laughing, rolling apples in the loft. Here we left our treasure—\" Stooping, their light lifts the lids upon my eyes. \"Safe! safe! safe!\" the pulse of the house beats wildly. Waking, I cry \"Oh, is this _your_ buried treasure? The light in the heart."
      },
      {
        "id": 1,
        "title": "Monday or Tuesday",
        "author": "Virginia Woolf",
        "text": "Lazy and indifferent, shaking space easily from his wings, knowing his way, the heron passes over the church beneath the sky. White and distant, absorbed in itself, endlessly the sky covers and uncovers, moves and remains. A lake? Blot the shores of it out! A mountain? Oh, perfect—the sun gold on its slopes. Down that falls. Ferns then, or white feathers, for ever and ever——\n\nDesiring truth, awaiting it, laboriously distilling a few words, for ever desiring—(a cry starts to the left, another to the right. Wheels strike divergently. Omnibuses conglomerate in conflict)—for ever desiring—(the clock asseverates with twelve distinct strokes that it is midday; light sheds gold scales; children swarm)—for ever desiring truth. Red is the dome; coins hang on the trees; smoke trails from the chimneys; bark, shout, cry \"Iron for sale\"—and truth?\n\nRadiating to a point men's feet and women's feet, black or gold-encrusted—(This foggy weather—Sugar? No, thank you—The commonwealth of the future)—the firelight darting and making the room red, save for the black figures and their bright eyes, while outside a van discharges, Miss Thingummy drinks tea at her desk, and plate-glass preserves fur coats——\n\nFlaunted, leaf-light, drifting at corners, blown across the wheels, silver-splashed, home or not home, gathered, scattered, squandered in separate scales, swept up, down, torn, sunk, assembled—and truth?\n\nNow to recollect by the fireside on the white square of marble. From ivory depths words rising shed their blackness, blossom and penetrate. Fallen the book; in the flame, in the smoke, in the momentary sparks—or now voyaging, the marble square pendant, minarets beneath and the Indian seas, while space rushes blue and stars glint—truth? or now, content with closeness?\n\nLazy and indifferent the heron returns; the sky veils her stars; then bares them."
      }
    ]
    
    對 Meilisearch 而言,什麼是大型文件?

    Meilisearch 在文件大小小於 1kb 時效果最佳。這大致相當於最多兩到三個段落的文本。

    分割文件

    在您的工作目錄中建立 split_documents.js 檔案

    #!/usr/bin/env node
    
    const datasetPath = process.argv[2];
    const datasetFile = fs.readFileSync(datasetPath);
    const documents = JSON.parse(datasetFile);
    
    const splitDocuments = [];
    
    for (let documentNumber = documents.length, i = 0; i < documentNumber; i += 1) {
      const document = documents[i];
      const story = document.text;
    
      const paragraphs = story.split("\n\n");
      
      for (let paragraphNumber = paragraphs.length, o = 0; o < paragraphNumber; o += 1) {
        splitDocuments.push({
          "id": document.id,
          "title": document.title,
          "author": document.author,
          "text": paragraphs[o]
        });
      }
    }
    
    fs.writeFileSync("stories-split.json", JSON.stringify(splitDocuments));
    

    接下來,在您的主控台上執行指令碼,指定您的 JSON 資料集的路徑

    node ./split_documents.js ./stories.json
    

    此指令碼接受一個引數:指向 JSON 資料集的路徑。它會讀取檔案並剖析其中的每個文件。對於文件中 text 欄位中的每個段落,它會建立一個具有新 idtext 欄位的新文件。最後,它會將新文件寫入 stories-split.json

    產生唯一 ID

    現在,Meilisearch 不會接受新的資料集,因為許多文件共用相同的主索引鍵。

    更新上一步的指令碼,以建立新的欄位 story_id

    #!/usr/bin/env node
    
    const datasetPath = process.argv[2];
    const datasetFile = fs.readFileSync(datasetPath);
    const documents = JSON.parse(datasetFile);
    
    const splitDocuments = [];
    
    for (let documentNumber = documents.length, i = 0; i < documentNumber; i += 1) {
      const document = documents[i];
      const story = document.text;
    
      const paragraphs = story.split("\n\n");
      
      for (let paragraphNumber = paragraphs.length, o = 0; o < paragraphNumber; o += 1) {
        splitDocuments.push({
          "story_id": document.id,
          "id": `${document.id}-${o}`,
          "title": document.title,
          "author": document.author,
          "text": paragraphs[o]
        });
      }
    }
    

    指令碼現在會將原始文件的 id 儲存在 story_id 中。然後,它會為每個新文件建立一個新的唯一識別碼,並將其儲存在主索引鍵欄位中。

    設定不同的屬性

    此資料集現在有效,但由於每個文件實際上都指向同一個故事,查詢很可能會導致重複的搜尋結果。

    為了防止這種情況發生,請將 story_id 設定為索引的不同屬性

    curl \
      -X PUT 'https://127.0.0.1:7700/indexes/INDEX_NAME/settings/distinct-attribute' \
      -H 'Content-Type: application/json' \
      --data-binary '"story_id"'
    

    搜尋此資料集的使用者現在將能夠在大量文本中找到更相關的結果,而不會有任何效能損失,且沒有重複項。

    結論

    您已了解如何分割大型文件以改善搜尋相關性。您也了解如何設定不同的屬性,以防止 Meilisearch 傳回重複的結果。

    雖然本指南使用 JavaScript,但您可以使用任何您熟悉的程式設計語言來複製此流程。