[ElasticSearch]Java API 之 滚动搜索(Scroll API)

news/2025/2/9 4:11:49 标签: Elasticsearch, 滚动查询

一般搜索请求都是返回一"页"数据,无论数据量多大都一起返回给用户,Scroll API可以允许我们检索大量数据(甚至全部数据)。Scroll API允许我们做一个初始阶段搜索并且持续批量从Elasticsearch里拉取结果直到没有结果剩下。这有点像传统数据库里的cursors(游标)。

Scroll API的创建并不是为了实时的用户响应,而是为了处理大量的数据(Scrolling is not intended for real time user requests, but rather for processing large amounts of data)。从 scroll 请求返回的结果只是反映了 search 发生那一时刻的索引状态,就像一个快照(The results that are returned from a scroll request reflect the state of the index at the time that the initial search request was made, like a snapshot in time)。后续的对文档的改动(索引、更新或者删除)都只会影响后面的搜索请求。
1. 普通请求

假设我们想一次返回大量数据,下面代码中一次请求58000条数据:

       /**
        *  普通搜索
        * @param client
        */
       public static void search(Client client) {
           String index = "simple-index";
           String type = "simple-type";
           // 搜索条件
           SearchRequestBuilder searchRequestBuilder = client.prepareSearch();
           searchRequestBuilder.setIndices(index);
           searchRequestBuilder.setTypes(type);
           searchRequestBuilder.setSize(58000);
           // 执行
           SearchResponse searchResponse = searchRequestBuilder.get();
           // 搜索结果
           SearchHit[] searchHits = searchResponse.getHits().getHits();
           for (SearchHit searchHit : searchHits) {
               String source = searchHit.getSource().toString();
               logger.info("--------- searchByScroll source {}", source);
           } // for
       }

运行结果:

    Caused by: QueryPhaseExecutionException[Result window is too large, from + size must be less than or equal to: [10000] but was [58000]. See the scroll api for a more efficient way to request large data sets. This limit can be set by changing the [index.max_result_window] index level parameter.]
    at org.elasticsearch.search.internal.DefaultSearchContext.preProcess(DefaultSearchContext.java:212)
    at org.elasticsearch.search.query.QueryPhase.preProcess(QueryPhase.java:103)
    at org.elasticsearch.search.SearchService.createContext(SearchService.java:676)
    at org.elasticsearch.search.SearchService.createAndPutContext(SearchService.java:620)
    at org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:371)
    at org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryTransportHandler.messageReceived(SearchServiceTransportAction.java:368)
    at org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryTransportHandler.messageReceived(SearchServiceTransportAction.java:365)
    at org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:33)
    at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:75)
    at org.elasticsearch.transport.TransportService$4.doRun(TransportService.java:376)
    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
    ... 3 more

从上面我们可以知道,搜索请求一次请求最大量为[10000]。我们的请求量已经超标,因此报错,异常信息提示我们请求大数据量的情况下使用Scroll API。
2. 使用Scroll API 请求

为了使用 scroll,初始搜索请求应该在查询中指定 scroll 参数,告诉 Elasticsearch 需要保持搜索的上下文环境多长时间(滚动时间)。

    searchRequestBuilder.setScroll(new TimeValue(60000));

下面代码中指定了查询条件以及滚动属性,如滚动的有效时长(使用setScroll()方法)。我们通过SearchResponse对象的getScrollId()方法获取滚动ID。滚动ID会在下一次请求中使用。

       /**
        * 使用scroll进行搜索
        * @param client
        */
       public static String searchByScroll(Client client) {
           String index = "simple-index";
           String type = "simple-type";
           // 搜索条件
           SearchRequestBuilder searchRequestBuilder = client.prepareSearch();
           searchRequestBuilder.setIndices(index);
           searchRequestBuilder.setTypes(type);
           searchRequestBuilder.setScroll(new TimeValue(30000));
           // 执行
           SearchResponse searchResponse = searchRequestBuilder.get();
           String scrollId = searchResponse.getScrollId();
           logger.info("--------- searchByScroll scrollID {}", scrollId);
           SearchHit[] searchHits = searchResponse.getHits().getHits();
           for (SearchHit searchHit : searchHits) {
               String source = searchHit.getSource().toString();
               logger.info("--------- searchByScroll source {}", source);
           } // for
           return scrollId;
           
       }

使用上面的请求返回的结果中的滚动ID,这个 ID 可以传递给 scroll API 来检索下一个批次的结果。这一次请求中不用添加索引和类型,这些都指定在了原始的 search 请求中。

每次返回下一个批次结果 直到没有结果返回时停止 即hits数组空时(Each call to the scroll API returns the next batch of results until there are no more results left to return, ie the hits array is empty)。

       /**
        *  通过滚动ID获取文档
        * @param client
        * @param scrollId
        */
       public static void searchByScrollId(Client client, String scrollId){
           TimeValue timeValue = new TimeValue(30000);
           SearchScrollRequestBuilder searchScrollRequestBuilder;
           SearchResponse response;
           // 结果
           while (true) {
               logger.info("--------- searchByScroll scrollID {}", scrollId);
               searchScrollRequestBuilder = client.prepareSearchScroll(scrollId);
               // 重新设定滚动时间
               searchScrollRequestBuilder.setScroll(timeValue);
               // 请求
               response = searchScrollRequestBuilder.get();
               // 每次返回下一个批次结果 直到没有结果返回时停止 即hits数组空时
               if (response.getHits().getHits().length == 0) {
                   break;
               } // if
               // 这一批次结果
               SearchHit[] searchHits = response.getHits().getHits();
               for (SearchHit searchHit : searchHits) {
                   String source = searchHit.getSource().toString();
                   logger.info("--------- searchByScroll source {}", source);
               } // for
               // 只有最近的滚动ID才能被使用
               scrollId = response.getScrollId();
           } // while
       }

备注:

初始搜索请求和每个后续滚动请求返回一个新的 滚动ID——只有最近的滚动ID才能被使用。(The initial search request and each subsequent scroll request returns a new_scroll_id — only the most recent _scroll_id should be used)  

我每次后续滚动请求返回的滚动ID都是相同的,所以对上面的备注,不是很懂,有明白的可以告知,谢谢。


如果超过滚动时间,继续使用该滚动ID搜索数据,则会报错:

    Caused by: SearchContextMissingException[No search context found for id [2861]]
    at org.elasticsearch.search.SearchService.findContext(SearchService.java:613)
    at org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:403)
    at org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryScrollTransportHandler.messageReceived(SearchServiceTransportAction.java:384)
    at org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryScrollTransportHandler.messageReceived(SearchServiceTransportAction.java:381)
    at org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:33)
    at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:75)
    at org.elasticsearch.transport.TransportService$4.doRun(TransportService.java:376)
    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)


3. 清除滚动ID

虽然当滚动有效时间已过,搜索上下文(Search Context)会自动被清除,但是一值保持滚动代价也是很大的,所以当我们不在使用滚动时要尽快使用Clear-Scroll API进行清除。

    /**
    * 清除滚动ID
    * @param client
    * @param scrollIdList
    * @return
    */
    public static boolean clearScroll(Client client, List<String> scrollIdList){
    ClearScrollRequestBuilder clearScrollRequestBuilder = client.prepareClearScroll();
    clearScrollRequestBuilder.setScrollIds(scrollIdList);
    ClearScrollResponse response = clearScrollRequestBuilder.get();
    return response.isSucceeded();
    }
    /**
    * 清除滚动ID
    * @param client
    * @param scrollId
    * @return
    */
    public static boolean clearScroll(Client client, String scrollId){
    ClearScrollRequestBuilder clearScrollRequestBuilder = client.prepareClearScroll();
    clearScrollRequestBuilder.addScrollId(scrollId);
    ClearScrollResponse response = clearScrollRequestBuilder.get();
    return response.isSucceeded();
    }

 

4. 参考:

https://www.elastic.co/guide/en/elasticsearch/reference/2.4/search-request-scroll.html

http://www.jianshu.com/p/14aa8b09c789
5. 说明

本代码基于ElasticSearch 2.4.1
---------------------
作者:SunnyYoona
来源:CSDN
原文:https://blog.csdn.net/sunnyyoona/article/details/52810397
版权声明:本文为博主原创文章,转载请附上博文链接!


http://www.niftyadmin.cn/n/1639122.html

相关文章

mysql数据实时同步到Elasticsearch

业务需要把mysql的数据实时同步到ES&#xff0c;实现低延迟的检索到ES中的数据或者进行其它数据分析处理。本文给出以同步mysql binlog的方式实时同步数据到ES的思路, 实践并验证该方式的可行性&#xff0c;以供参考。 mysql binlog日志 mysql的binlog日志主要用于数据库的主…

[拦截器]关于拦截方法调用其他内部方法无法被拦截问题的解决

拦截器的实现原理很简单&#xff0c;就是动态代理&#xff0c;实现AOP机制。当外部调用被拦截bean的拦截方法时&#xff0c;可以选择在拦截之前或者之后等条件执行拦截方法之外的逻辑&#xff0c;比如特殊权限验证&#xff0c;参数修正等操作。但是如果现在一个需求是&#xff…

使用Logstash来实时同步MySQL数据到ES

本篇我们来实战从MYSQL里直接同步数据 一、首先下载和你的ES对应的logstash版本&#xff0c;本篇我们使用的都是6.1.1 下载后使用logstash-plugin install logstash-input-jdbc 命令安装jdbc的数据连接插件 二、新增mysqltoes.conf文件&#xff0c;配置Input和output参数如下&…

[lucene第三季]Lucene那点事儿-总结篇

前面两篇文章&#xff0c;简单尝试了lucene的一些应用&#xff0c;还是再回头想想我们的需求吧&#xff0c;我们希望能够开发一个淘宝一样的针对商品的搜索服务&#xff0c;提供多种条件的组合搜索&#xff0c;并且对于性能提出了一定的要求。同时我们希望这个小型的搜索引擎具…

CentOS7下安装部署ES及head插件安装

1&#xff0e;新建一个用户elasticsearch,当然也可以不创建用户,直接用系统用户来安装和运行elasticserach [rootlocalhost ~]#useradd elasticsearch 接下来修改系统配置,这里不修改的话es运行会报错: max file descriptors [4096] for elasticsearch process is too low,…

[lucene那点事儿]想说爱你很容易

内容提要&#xff1a; ---------------------目录开始-------------------- 1、索引精确刷新问题 2、利用缓存提高索引批量更新拦截器的性能 3、针对不同的数据来源建立不同的索引并分域存放 4、引入xml配置文件的方式实现索引建立的动态配置 5、单值搜索、组合条件搜索等…

[lucene异常]why am I getting a TooManyClause exception

异常情况&#xff1a; org.apache.lucene.search.BooleanQuery$TooManyClauses: maxClauseCount is set to 1024 at org.apache.lucene.search.BooleanQuery.add(BooleanQuery.java:165) at org.apache.lucene.search.BooleanQuery.add(BooleanQuery.java:156) at org.apache.…

[小插曲]Eclipse设置高亮显示

在eclipse中使用快捷键&#xff0c;不小心按错了&#xff0c;使得变量的高亮显示没了。 其恢复方式如下&#xff1a; 选择:windows-> preferences->java->Editor->Mark Occurences 选择最上的复选框&#xff0c;下面的就有很多了。 其中的Local variables就是变…