Mastodonで自鯖の検索をTwitterレベルにする

Mastodonの検索は、基本的にはハッシュタグがついた投稿しか検索できません。サーバーにElasticsearchを導入することで全文検索が可能になります。この場合も、自分と自分に関係する投稿しか検索は出来ないようになっています。 Twitterのように、投稿本文から検索して新たなフォロー先を見つけることが難しく利便性が落ちています。自分のサーバー内に蓄積された投稿全てから全文検索できるように修正するパッチです。

サーバーのスペック

Elasticsearchが、メモリを1GBは持って行ってしまうので、1台でやる場合はメモリが4GBは欲しいところです。

Elasticsearchの導入

公式ドキュメント通りに、Elasticsearchをインストールします。

Elasticsearchのインストール

# apt install openjdk-17-jre-headless

# wget -O /usr/share/keyrings/elasticsearch.asc https://artifacts.elastic.co/GPG-KEY-elasticsearch
# echo "deb [signed-by=/usr/share/keyrings/elasticsearch.asc] https://artifacts.elastic.co/packages/7.x/apt stable main" > /etc/apt/sources.list.d/elastic-7.x.list

# apt update
# apt install elasticsearch

# systemctl daemon-reload
# systemctl enable --now elasticsearch

Mastodonの環境設定

ES_ENABLED=true
ES_HOST=localhost
ES_PORT=9200

Mastodonの設定反映

# systemctl restart mastodon-sidekiq
# systemctl reload mastodon-web

Sudachiプラグインの導入

日本語検索ができるように、ElasticsearchにSudachiプラグインをインストールします。

$ curl -XGET 'http://localhost:9200'

{
	"name" : "mastodon.lithium03.info",
	"cluster_name" : "elasticsearch",
	"cluster_uuid" : "JBd8cuaSSt-2lQAOKzu6SA",
	"version" : {
		"number" : "7.17.9",
		"build_flavor" : "default",
		"build_type" : "deb",
		"build_hash" : "ef48222227ee6b9e70e502f0f0daa52435ee634d",
		"build_date" : "2023-01-31T05:34:43.305517834Z",
		"build_snapshot" : false,
		"lucene_version" : "8.11.1",
		"minimum_wire_compatibility_version" : "6.8.0",
		"minimum_index_compatibility_version" : "6.0.0-beta1"
	},
	"tagline" : "You Know, for Search"
}

配布バイナリのElasticsearchのバージョンがあっていないことがあるので、ソースから生成します。

$ git clone https://github.com/WorksApplications/elasticsearch-sudachi.git
$ cd elasticsearch-sudachi
$ git tag
$ git checkout v3.0.0

先程確認したバージョン番号に合わせて、コンパイルします。

$ ./gradlew -PelasticsearchVersion=7.17.9 build

$ cd /usr/share/elasticsearch
$ sudo bin/elasticsearch-plugin install file:///path/to/elasticsearch-sudachi/build/distributions/analysis-sudachi-7.17.9-3.0.0.zip

$ wget http://sudachi.s3-website-ap-northeast-1.amazonaws.com/sudachidict/sudachi-dictionary-latest-full.zip
$ unzip sudachi-dictionary-latest-full.zip

$ sudo mkdir -p /etc/elasticsearch/sudachi
$ sudo cp sudachi-dictionary-20230110/system_full.dic /etc/elasticsearch/sudachi/system_core.dic
$ cd /etc/elasticsearch/sudachi
$ sudo wget https://raw.githubusercontent.com/WorksApplications/Sudachi/develop/src/main/resources/sudachi.json
$ sudo wget https://raw.githubusercontent.com/WorksApplications/Sudachi/develop/src/main/resources/char.def

これでプラグインの準備が整ったので、Elasticsearchを再起動します。

$ sudo systemctl restart elasticsearch.service

Mastodonのソースコードの修正

sudachi_tokenizerを使用して、日本語検索できるように変更します。また、サーバー内の全ての投稿を検索対象とするように検索対象を変更します。 ActivityPub::Activity::Createで、ローカルに関係あるもののみ記録されるところを、流れたもの全てに変え、雑にredis.sadd("chewy:queue:StatusesIndex", @status.id) でインデックスに流します。

diff --git a/app/chewy/statuses_index.rb b/app/chewy/statuses_index.rb
index 6dd4fb18b..451555a31 100644
--- a/app/chewy/statuses_index.rb
+++ b/app/chewy/statuses_index.rb
@@ -4,6 +4,13 @@ class StatusesIndex < Chewy::Index
   include FormattingHelper
 
   settings index: { refresh_interval: '30s' }, analysis: {
+    tokenizer: {
+      sudachi_tokenizer: {
+        type: 'sudachi_tokenizer',
+        discard_punctuation: true,
+        ignore_unavailable: true,
+      },
+    },
     filter: {
       english_stop: {
         type: 'stop',
@@ -20,12 +27,16 @@ class StatusesIndex < Chewy::Index
     },
     analyzer: {
       content: {
-        tokenizer: 'uax_url_email',
+        tokenizer: 'sudachi_tokenizer',
+        type: 'custom',
         filter: %w(
-          english_possessive_stemmer
           lowercase
-          asciifolding
           cjk_width
+          sudachi_part_of_speech
+          sudachi_ja_stop
+          sudachi_baseform
+          english_possessive_stemmer
+          asciifolding
           english_stop
           english_stemmer
         ),
diff --git a/app/lib/activitypub/activity/create.rb b/app/lib/activitypub/activity/create.rb
index b15e66ca2..21b55cfba 100644
--- a/app/lib/activitypub/activity/create.rb
+++ b/app/lib/activitypub/activity/create.rb
@@ -85,6 +85,8 @@ class ActivityPub::Activity::Create < ActivityPub::Activity
       attach_tags(@status)
     end
 
+    redis.sadd("chewy:queue:StatusesIndex", @status.id)
+
     resolve_thread(@status)
     fetch_replies(@status)
     distribute
@@ -388,7 +390,7 @@ class ActivityPub::Activity::Create < ActivityPub::Activity
 
   def related_to_local_activity?
     fetch? || followed_by_local_accounts? || requested_through_relay? ||
-      responds_to_followed_account? || addresses_local_accounts?
+      responds_to_followed_account? || addresses_local_accounts? || true
   end
 
   def responds_to_followed_account?
diff --git a/app/lib/importer/statuses_index_importer.rb b/app/lib/importer/statuses_index_importer.rb
index 5b5153d5c..dea6bb2d9 100644
--- a/app/lib/importer/statuses_index_importer.rb
+++ b/app/lib/importer/statuses_index_importer.rb
@@ -25,7 +25,7 @@ class Importer::StatusesIndexImporter < Importer::BaseImporter
           # on the results of the filter, so this filtering happens here instead
           bulk.map! do |entry|
             new_entry = begin
-              if entry[:index] && entry.dig(:index, :data, 'searchable_by').blank?
+              if false && entry[:index] && entry.dig(:index, :data, 'searchable_by').blank?
                 { delete: entry[:index].except(:data) }
               else
                 entry
@@ -59,6 +59,7 @@ class Importer::StatusesIndexImporter < Importer::BaseImporter
 
   def scopes
     [
+      remote_statuses_scope,
       local_statuses_scope,
       local_mentions_scope,
       local_favourites_scope,
@@ -86,4 +87,8 @@ class Importer::StatusesIndexImporter < Importer::BaseImporter
   def local_statuses_scope
     Status.local.select('"statuses"."id", COALESCE("statuses"."reblog_of_id", "statuses"."id") AS status_id')
   end
+  
+  def remote_statuses_scope
+    Status.remote.select('"statuses"."id", COALESCE("statuses"."reblog_of_id", "statuses"."id") AS status_id')
+  end
 end
diff --git a/app/lib/search_query_transformer.rb b/app/lib/search_query_transformer.rb
index aef05e9d9..25d4200ad 100644
--- a/app/lib/search_query_transformer.rb
+++ b/app/lib/search_query_transformer.rb
@@ -25,7 +25,8 @@ class SearchQueryTransformer < Parslet::Transform
     def clause_to_query(clause)
       case clause
       when TermClause
-        { multi_match: { type: 'most_fields', query: clause.term, fields: ['text', 'text.stemmed'] } }
+        #{ multi_match: { type: 'most_fields', query: clause.term, fields: ['text', 'text.stemmed'] } }
+        { match_phrase: { 'text.stemmed': { query: clause.term } } }
       when PhraseClause
         { match_phrase: { text: { query: clause.phrase } } }
       else
diff --git a/app/services/search_service.rb b/app/services/search_service.rb
index 1a76cbb38..ce37c02bc 100644
--- a/app/services/search_service.rb
+++ b/app/services/search_service.rb
@@ -35,7 +35,9 @@ class SearchService < BaseService
   end
 
   def perform_statuses_search!
-    definition = parsed_query.apply(StatusesIndex.filter(term: { searchable_by: @account.id }))
+    #definition = parsed_query.apply(StatusesIndex.filter(term: { searchable_by: @account.id }))
+    definition = parsed_query.apply(StatusesIndex).order(id: :desc)
+
 
     if @options[:account_id].present?
       definition = definition.filter(term: { account_id: @options[:account_id] })
@@ -118,7 +120,7 @@ class SearchService < BaseService
       blocking: Account.blocking_map(account_ids, account.id),
       blocked_by: Account.blocked_by_map(account_ids, account.id),
       muting: Account.muting_map(account_ids, account.id),
-      following: Account.following_map(account_ids, account.id),
+      #following: Account.following_map(account_ids, account.id),
       domain_blocking_by_domain: Account.domain_blocking_map_by_domain(domains, account.id),
     }
   end

$ cd live
$ patch -p1 < modify.patch

# systemctl restart mastodon-sidekiq
# systemctl reload mastodon-web

初期インデックス生成

ここまでうまくいくと、初期インデックス生成が成功します。