Skip to main content
Version: current

Record Extractor

info

The following content is for DocSearch v3 and its new infrastructure. If you are using DocSearch v2 or the docsearch-scraper, see the legacy documentation.

Introduction#

info

This documentation will only contain information regarding the helpers.docsearch method, see Algolia Crawler Documentation for more information on the Algolia Crawler.

Pages are extracted by a recordExtractor. These extractors are assigned to actions via the recordExtractor parameter. This parameter links to a function that returns the data you want to index, organized in a array of JSON objects.

The helpers are a collection of functions to help you extract content and generate Algolia records.

Useful links#

Usage#

The most common way to use the DocSearch helper, is to return its result to the recordExtractor function.

recordExtractor: ({ helpers }) => {  return helpers.docsearch({    recordProps: {      lvl0: {        selectors: "header h1",      },      lvl1: "article h2",      lvl2: "article h3",      lvl3: "article h4",      lvl4: "article h5",      lvl5: "article h6",      content: "main p, main li",    },  });},

Complex extractors#

Using the Cheerio instance ($)#

You can also use the provided Cheerio instance ($) to exclude content from the DOM:

recordExtractor: ({ $, helpers }) => {  // Removing DOM elements we don't want to crawl  $(".my-warning-message").remove();
  return helpers.docsearch({    recordProps: {      lvl0: {        selectors: "header h1",      },      lvl1: "article h2",      lvl2: "article h3",      lvl3: "article h4",      lvl4: "article h5",      lvl5: "article h6",      content: "main p, main li",    },  });},

With fallback DOM selectors#

Each lvlX and content supports fallback selectors as an array of string, which allows for robust config files:

recordExtractor: ({ $, helpers }) => {  return helpers.docsearch({    recordProps: {      // `.exists h1` will be selected if `.exists-probably h1` does not exists.      lvl0: {        selectors: [".exists-probably h1", ".exists h1"],      }      lvl1: "article h2",      lvl2: "article h3",      lvl3: "article h4",      lvl4: "article h5",      lvl5: "article h6",      // `.exists p, .exists li` will be selected.      content: [        ".does-not-exists p, .does-not-exists li",        ".exists p, .exists li",      ],    },  });},

With custom variables#

Custom variables are useful to filter content in the frontend (version, lang, etc.).

These selectors also support defaultValue and fallback selectors

recordExtractor: ({ helpers }) => {  return helpers.docsearch({    recordProps: {      lvl0: {        selectors: "header h1",      },      lvl1: "article h2",      lvl2: "article h3",      lvl3: "article h4",      lvl4: "article h5",      lvl5: "article h6",      content: "main p, main li",      // The variables below can be used to filter your search      foo: ".bar",      language: {        // It also supports the fallback DOM selectors syntax!        selectors: ".does-not-exists",        // Since custom variables are used for filtering, we allow sending        // multiple raw values        defaultValue: ["en", "en-US"],      },      version: {        // You can send raw values without `selectors`        defaultValue: ["latest", "stable"],      },    },  });},

The version, lang and foo attribute of these records will be :

foo: "valueFromBarSelector",language: ["en", "en-US"],version: ["latest", "stable"]

You can now use them to filter your search in the frontend

With raw text (defaultValue)#

The lvl0 and custom variables selectors also accepts a fallback raw value:

recordExtractor: ({ $, helpers }) => {  return helpers.docsearch({    recordProps: {      lvl0: {        // It also supports the fallback DOM selectors syntax!        selectors: ".exists-probably h1",        defaultValue: "myRawTextIfDoesNotExists",      },      lvl1: "article h2",      lvl2: "article h3",      lvl3: "article h4",      lvl4: "article h5",      lvl5: "article h6",      content: "main p, main li",      // The variables below can be used to filter your search      language: {        // It also supports the fallback DOM selectors syntax!        selectors: ".exists-probably .language",        // Since custom variables are used for filtering, we allow sending        // multiple raw values        defaultValue: ["en", "en-US"],      },    },  });},

recordProps API Reference#

lvl0#

type: Lvl0 | required

type Lvl0 = {  selectors: string | string[];  defaultValue?: string;};

lvl1, content#

type: string | string[] | required

lvl2, lvl3, lvl4, lvl5, lvl6#

type: string | string[] | optional

pageRank#

type: string | optional

Custom variables ([k: string])#

type: string | string[] | CustomVariable | optional

type CustomVariable =  | {      defaultValue: string | string[];    }  | {      selectors: string | string[];      defaultValue?: string | string[];    };

Contains values that can be used as facetFilters