[源碼解析] GroupReduce，GroupCombine 和 Flink SQL group by

by admin · Published 2020-11-16 · Updated 2020-11-16

[源碼解析] GroupReduce，GroupCombine和Flink SQL group by

[源碼解析] GroupReduce，GroupCombine和Flink SQL group by
- 0x00 摘要
- 0x01 緣由
- 0x02 概念
  - 2.1 GroupReduce
  - 2.2 GroupCombine
  - 2.3 例子
- 0x03 代碼
- 0x04 Flink SQL內部翻譯
- 0x05 JobGraph
- 0x06 Runtime
  - 6.1 ChainedFlatMapDriver
  - 6.2 GroupReduceCombineDriver
  - 6.3 GroupReduceDriver & ChainedFlatMapDriver
- 0x07 總結
- 0x08 參考

0x00 摘要

本文從源碼和實例入手，為大家解析 Flink 中 GroupReduce 和 GroupCombine 的用途。也涉及到了 Flink SQL group by 的內部實現。

0x01 緣由

在前文[源碼解析] Flink的Groupby和reduce究竟做了什麼中，我們剖析了Group和reduce都做了些什麼，也對combine有了一些了解。但是總感覺意猶未盡，因為在Flink還提出了若干新算子，比如GroupReduce和GroupCombine。這幾個算子不搞定，總覺得如鯁在喉，但沒有找到一個良好的例子來進行剖析說明。

本文是筆者在探究Flink SQL UDF問題的一個副產品。起初是為了調試一段sql代碼，結果發現Flink本身給出了一個GroupReduce和GroupCombine使用的完美例子。於是就拿出來和大家共享，一起分析看看究竟如何使用這兩個算子。

請注意：這個例子是Flink SQL，所以本文中將涉及Flink SQL goup by內部實現的知識。

0x02 概念

Flink官方對於這兩個算子的使用說明如下：

2.1 GroupReduce

GroupReduce算子應用在一個已經分組了的DataSet上，其會對每個分組都調用到用戶定義的group-reduce函數。它與Reduce的區別在於用戶定義的函數會立即獲得整個組。

Flink將在組的所有元素上使用Iterable調用用戶自定義函數，並且可以返回任意數量的結果元素。

2.2 GroupCombine

GroupCombine轉換是可組合GroupReduceFunction中組合步驟的通用形式。它在某種意義上被概括為允許將輸入類型 I 組合到任意輸出類型O。與此相對的是，GroupReduce中的組合步驟僅允許從輸入類型 I 到輸出類型 I 的組合。這是因為GroupReduceFunction的 “reduce步驟” 期望自己的輸入類型為 I。

在一些應用中，我們期望在執行附加變換（例如，減小數據大小）之前將DataSet組合成中間格式。這可以通過CombineGroup轉換能以非常低的成本實現。

注意：分組數據集上的GroupCombine在內存中使用貪婪策略執行，該策略可能不會一次處理所有數據，而是以多個步驟處理。它也可以在各個分區上執行，而無需像GroupReduce轉換那樣進行數據交換。這可能會導致輸出的是部分結果，所以GroupCombine是不能替代GroupReduce操作的，儘管它們的操作內容可能看起來都一樣。

2.3 例子

是不是有點暈？還是直接讓代碼來說話吧。以下官方示例演示了如何將CombineGroup和GroupReduce轉換用於WordCount實現。即通過combine操作先對單詞數目進行初步排序，然後通過reduceGroup對combine產生的結果進行最終排序。因為combine進行了初步排序，所以在算子之間傳輸的數據量就少多了。

DataSet<String> input = [..] // The words received as input

// 這裏通過combine操作先對單詞數目進行初步排序，其優勢在於用戶定義的combine函數只調用一次，因為runtime已經把輸入數據一次性都提供給了自定義函數。  
DataSet<Tuple2<String, Integer>> combinedWords = input
  .groupBy(0) // group identical words
  .combineGroup(new GroupCombineFunction<String, Tuple2<String, Integer>() {

    public void combine(Iterable<String> words, Collector<Tuple2<String, Integer>>) { // combine
        String key = null;
        int count = 0;

        for (String word : words) {
            key = word;
            count++;
        }
        // emit tuple with word and count
        out.collect(new Tuple2(key, count));
    }
});

// 這裏對combine的結果進行第二次排序，其優勢在於用戶定義的reduce函數只調用一次，因為runtime已經把輸入數據一次性都提供給了自定義函數。  
DataSet<Tuple2<String, Integer>> output = combinedWords
  .groupBy(0)                              // group by words again
  .reduceGroup(new GroupReduceFunction() { // group reduce with full data exchange

    public void reduce(Iterable<Tuple2<String, Integer>>, Collector<Tuple2<String, Integer>>) {
        String key = null;
        int count = 0;

        for (Tuple2<String, Integer> word : words) {
            key = word;
            count++;
        }
        // emit tuple with word and count
        out.collect(new Tuple2(key, count));
    }
});

看到這裏，有的兄弟已經明白了，這和mapPartition很類似啊，都是runtime做了大量工作。為了讓大家這兩個算子的使用情形有深刻的認識，我們再通過一個sql的例子，向大家展示Flink內部是怎麼應用這兩個算子的，也能看出來他們的強大之處。

0x03 代碼

下面代碼主要參考自 flink 使用問題匯總。我們可以看到這裏通過groupby進行了聚合操作。其中collect方法，類似於mysql的group_concat。

public class UdfExample {
    public static class MapToString extends ScalarFunction {

        public String eval(Map<String, Integer> map) {
            if(map==null || map.size()==0) {
                return "";
            }
            StringBuffer sb=new StringBuffer();
            for(Map.Entry<String, Integer> entity : map.entrySet()) {
                sb.append(entity.getKey()+",");
            }
            String result=sb.toString();
            return result.substring(0, result.length()-1);
        }
    }

    public static void main(String[] args) throws Exception {
        MemSourceBatchOp src = new MemSourceBatchOp(new Object[][]{
                new Object[]{"1", "a", 1L},
                new Object[]{"2", "b33", 2L},
                new Object[]{"2", "CCC", 2L},
                new Object[]{"2", "xyz", 2L},
                new Object[]{"1", "u", 1L}
        }, new String[]{"f0", "f1", "f2"});

        BatchTableEnvironment environment = MLEnvironmentFactory.getDefault().getBatchTableEnvironment();
        Table table = environment.fromDataSet(src.getDataSet());
        environment.registerTable("myTable", table);
        BatchOperator.registerFunction("MapToString",  new MapToString());
        BatchOperator.sqlQuery("select f0, mapToString(collect(f1)) as type from myTable group by f0").print();
    }
}

程序輸出是

f0|type
--|----
1|a,u
2|CCC,b33,xyz

0x04 Flink SQL內部翻譯

這個SQL語句的重點是group by。這個是程序猿經常使用的操作。但是大家有沒有想過這個group by在真實運行起來時候是怎麼操作的呢？針對大數據環境有沒有做了什麼優化呢？其實，Flink正是使用了GroupReduce和GroupCombine來實現並且優化了group by的功能。優化之處在於：

GroupReduce和GroupCombine的函數調用次數要遠低於正常的reduce算子，如果reduce操作中涉及到頻繁創建額外的對象或者外部資源操作，則會相當省時間。
因為combine進行了初步排序，所以在算子之間傳輸的數據量就少多了。

SQL生成Flink的過程十分錯綜複雜，所以我們只能找最關鍵處。其是在 DataSetAggregate.translateToPlan 完成的。我們可以看到，對於SQL語句 “select f0, mapToString(collect(f1)) as type from myTable group by f0”，Flink系統把它翻譯成如下階段，即

pre-aggregation ：排序 + combine；
final aggregation ：排序 + reduce；

從之前的文章我們可以知道，groupBy這個其實不是一個算子，它只是排序過程中的一個輔助步驟而已，所以我們重點還是要看combineGroup和reduceGroup。這恰恰是我們想要的完美例子。

input ----> (groupBy + combineGroup) ----> (groupBy + reduceGroup) ----> output

SQL生成的Scala代碼如下，其中 combineGroup在後續中將生成GroupCombineOperator，reduceGroup將生成GroupReduceOperator。

  override def translateToPlan(
      tableEnv: BatchTableEnvImpl,
      queryConfig: BatchQueryConfig): DataSet[Row] = {

    if (grouping.length > 0) {
      // grouped aggregation
      ...... 
      if (preAgg.isDefined) { // 我們的例子是在這裏
        inputDS          
          // pre-aggregation
          .groupBy(grouping: _*)
          .combineGroup(preAgg.get) // 將生成GroupCombineOperator算子
          .returns(preAggType.get)
          .name(aggOpName)
          // final aggregation
          .groupBy(grouping.indices: _*) //將生成GroupReduceOperator算子。
          .reduceGroup(finalAgg.right.get)
          .returns(rowTypeInfo)
          .name(aggOpName)
      } else {
        ......
      }
    }
    else {
      ......
    }
  }
}

// 程序變量打印如下
this = {DataSetAggregate@5207} "Aggregate(groupBy: (f0), select: (f0, COLLECT(f1) AS $f1))"
 cluster = {RelOptCluster@5220}

0x05 JobGraph

LocalExecutor.execute中會生成JobGraph。JobGraph是提交給 JobManager 的數據結構，是唯一被Flink的數據流引擎所識別的表述作業的數據結構，也正是這一共同的抽象體現了流處理和批處理在運行時的統一。

在生成JobGraph時候，系統得到如下JobVertex。

jobGraph = {JobGraph@5652} "JobGraph(jobId: 6aae8b5e5ad32f588136bef26f8b65f6)"
 taskVertices = {LinkedHashMap@5655}  size = 4

{JobVertexID@5677} "c625209bb7fb9a098807551840aeaa99" -> {InputOutputFormatVertex@5678} "CHAIN DataSource (at initializeDataSource(MemSourceBatchOp.java:98) (org.apache.flink.api.java.io.CollectionInputFormat)) -> FlatMap (select: (f0, f1)) (org.apache.flink.runtime.operators.DataSourceTask)"

{JobVertexID@5679} "b56ace4acd7a2f69ea110a9f262ff80a" -> {JobVertex@5680} "CHAIN GroupReduce (groupBy: (f0), select: (f0, COLLECT(f1) AS $f1)) -> FlatMap (select: (f0, mapToString($f1) AS type)) -> Map (Map at linkFrom(MapBatchOp.java:35)) (org.apache.flink.runtime.operators.BatchTask)"
 
{JobVertexID@5681} "3f5e2a0f700421d80ce85e02a6d9db73" -> {InputOutputFormatVertex@5682} "DataSink (collect()) (org.apache.flink.runtime.operators.DataSinkTask)"
 
{JobVertexID@5683} "ad29dc5b2e0a39ad2cd1d164b6f859f7" -> {JobVertex@5684} "GroupCombine (groupBy: (f0), select: (f0, COLLECT(f1) AS $f1)) (org.apache.flink.runtime.operators.BatchTask)"

我們可以看到，在JobGraph中就生成了對應的兩個算子。其中這裏的FlatMap就是用戶的UDF函數MapToString的映射生成。

GroupCombine (groupBy: (f0), select: (f0, COLLECT(f1) AS $f1))  
  
CHAIN GroupReduce (groupBy: (f0), select: (f0, COLLECT(f1) AS $f1)) -> FlatMap (select: (f0, mapToString($f1) AS type)) -> Map

0x06 Runtime

最後，讓我們看看runtime會如何處理這兩個算子。

6.1 ChainedFlatMapDriver

首先，Flink會在ChainedFlatMapDriver.collect中對record進行處理，這是從Table中提取數據所必須經歷的，與後續的group by關係不大。

@Override
public void collect(IT record) {
   try {
      this.numRecordsIn.inc();
      this.mapper.flatMap(record, this.outputCollector);
   } catch (Exception ex) {
      throw new ExceptionInChainedStubException(this.taskName, ex);
   }
}

// 這裡能夠看出來，我們獲取了第一列記錄
record = {Row@9317} "1,a,1"
 fields = {Object[3]@9330} 
this.taskName = "FlatMap (select: (f0, f1))"

// 程序堆棧打印如下
collect:80, ChainedFlatMapDriver (org.apache.flink.runtime.operators.chaining)
collect:35, CountingCollector (org.apache.flink.runtime.operators.util.metrics)
invoke:196, DataSourceTask (org.apache.flink.runtime.operators)
doRun:707, Task (org.apache.flink.runtime.taskmanager)
run:532, Task (org.apache.flink.runtime.taskmanager)
run:748, Thread (java.lang)

6.2 GroupReduceCombineDriver

其次，GroupReduceCombineDriver.run()中會進行combine操作。

會通過this.sorter.write(value)把數據寫到排序緩衝區。
會通過sortAndCombineAndRetryWrite(value)進行實際的排序，合併。

因為是系統實現，所以Combine的用戶自定義函數就是由Table API提供的，比如org.apache.flink.table.functions.aggfunctions.CollectAccumulator.accumulate 。

@Override
public void run() throws Exception {
   final MutableObjectIterator<IN> in = this.taskContext.getInput(0);
   final TypeSerializer<IN> serializer = this.serializer;

   if (objectReuseEnabled) {
    .....
   }
   else {
      IN value;
      while (running && (value = in.next()) != null) {
         // try writing to the sorter first
         if (this.sorter.write(value)) {
            continue;
         }

         // do the actual sorting, combining, and data writing
         sortAndCombineAndRetryWrite(value);
      }
   }

   // sort, combine, and send the final batch
   if (running) {
      sortAndCombine();
   }
}

// 程序變量如下
value = {Row@9494} "1,a"
 fields = {Object[2]@9503}

sortAndCombine是具體排序/合併的過程。

排序是通過 org.apache.flink.runtime.operators.sort.QuickSort 完成的。
合併是通過 org.apache.flink.table.functions.aggfunctions.CollectAccumulator.accumulate 完成的。
給下游是由 org.apache.flink.table.runtime.aggregate.DataSetPreAggFunction.combine 調用 out.collect(output) 完成的。

private void sortAndCombine() throws Exception {
   final InMemorySorter<IN> sorter = this.sorter;
   // 這裏進行實際的排序
   this.sortAlgo.sort(sorter);
   final GroupCombineFunction<IN, OUT> combiner = this.combiner;
   final Collector<OUT> output = this.output;

   // iterate over key groups
   if (objectReuseEnabled) {
			......		
   } else {
      final NonReusingKeyGroupedIterator<IN> keyIter = 
            new NonReusingKeyGroupedIterator<IN>(sorter.getIterator(), this.groupingComparator);
      // 這裡是歸併操作
      while (this.running && keyIter.nextKey()) {
         // combiner.combiner 是用戶定義操作，runtime把某key對應的數據一次性傳給它
         combiner.combine(keyIter.getValues(), output);
      }
   }
}

具體調用棧如下：

accumulate:57, CollectAggFunction (org.apache.flink.table.functions.aggfunctions)
accumulate:-1, DataSetAggregatePrepareMapHelper$5
combine:71, DataSetPreAggFunction (org.apache.flink.table.runtime.aggregate)
sortAndCombine:213, GroupReduceCombineDriver (org.apache.flink.runtime.operators)
run:188, GroupReduceCombineDriver (org.apache.flink.runtime.operators)
run:504, BatchTask (org.apache.flink.runtime.operators)
invoke:369, BatchTask (org.apache.flink.runtime.operators)
doRun:707, Task (org.apache.flink.runtime.taskmanager)
run:532, Task (org.apache.flink.runtime.taskmanager)
run:748, Thread (java.lang)

6.3 GroupReduceDriver & ChainedFlatMapDriver

這兩個放在一起，是因為他們組成了Operator Chain。

GroupReduceDriver.run中完成了reduce。具體reduce 操作是在 org.apache.flink.table.runtime.aggregate.DataSetFinalAggFunction.reduce 完成的，然後在其中直接發送給下游 out.collect(output)。

@Override
public void run() throws Exception {
   // cache references on the stack
   final GroupReduceFunction<IT, OT> stub = this.taskContext.getStub();
 
   if (objectReuseEnabled) {
       ......	
   }
   else {
      final NonReusingKeyGroupedIterator<IT> iter = new NonReusingKeyGroupedIterator<IT>(this.input, this.comparator);
      // run stub implementation
      while (this.running && iter.nextKey()) {
         // stub.reduce 是用戶定義操作，runtime把某key對應的數據一次性傳給它
         stub.reduce(iter.getValues(), output);
      }
   }
}

從前文我們可以，這裏已經配置成了Operator Chain，所以out.collect(output)會調用到CountingCollector。CountingCollector的成員變量collector已經配置成了ChainedFlatMapDriver。

public void collect(OUT record) {
   this.numRecordsOut.inc();
   this.collector.collect(record);
}

this.collector = {ChainedFlatMapDriver@9643} 
 mapper = {FlatMapRunner@9610} 
 config = {TaskConfig@9655} 
 taskName = "FlatMap (select: (f0, mapToString($f1) AS type))"

於是程序就調用到了 ChainedFlatMapDriver.collect。

public void collect(IT record) {
   try {
      this.numRecordsIn.inc();
      this.mapper.flatMap(record, this.outputCollector);
   } catch (Exception ex) {
      throw new ExceptionInChainedStubException(this.taskName, ex);
   }
}

最終調用棧如如下：

eval:21, UdfExample$MapToString (com.alibaba.alink)
flatMap:-1, DataSetCalcRule$14
flatMap:52, FlatMapRunner (org.apache.flink.table.runtime)
flatMap:31, FlatMapRunner (org.apache.flink.table.runtime)
collect:80, ChainedFlatMapDriver (org.apache.flink.runtime.operators.chaining)
collect:35, CountingCollector (org.apache.flink.runtime.operators.util.metrics)
reduce:80, DataSetFinalAggFunction (org.apache.flink.table.runtime.aggregate)
run:131, GroupReduceDriver (org.apache.flink.runtime.operators)
run:504, BatchTask (org.apache.flink.runtime.operators)
invoke:369, BatchTask (org.apache.flink.runtime.operators)
doRun:707, Task (org.apache.flink.runtime.taskmanager)
run:532, Task (org.apache.flink.runtime.taskmanager)
run:748, Thread (java.lang)

0x07 總結

由此我們可以看到：

GroupReduce，GroupCombine和mapPartition十分類似，都是從系統層面對算子進行優化，把循環操作放到用戶自定義函數來處理。
對於group by這個SQL語句，Flink將其翻譯成 GroupReduce + GroupCombine，採用兩階段優化的方式來完成了對大數據下的處理。

0x08 參考

flink 使用問題匯總

本站聲明:網站內容來源於博客園,如有侵權,請聯繫我們,我們將及時處理

【其他文章推薦】

※網頁設計一頭霧水該從何著手呢? 台北網頁設計公司幫您輕鬆架站!

※網頁設計公司推薦不同的風格，搶佔消費者視覺第一線

※想知道購買電動車哪裡補助最多?台中電動車補助資訊懶人包彙整

※南投搬家公司費用,距離,噸數怎麼算?達人教你簡易估價知識!

※教你寫出一流的銷售文案?

※超省錢租車方案

[源碼解析] GroupReduce，GroupCombine 和 Flink SQL group by

[源碼解析] GroupReduce，GroupCombine和Flink SQL group by

0x00 摘要

0x01 緣由

0x02 概念

2.1 GroupReduce

2.2 GroupCombine

2.3 例子

0x03 代碼

0x04 Flink SQL內部翻譯

0x05 JobGraph

0x06 Runtime

6.1 ChainedFlatMapDriver

6.2 GroupReduceCombineDriver

6.3 GroupReduceDriver & ChainedFlatMapDriver

0x07 總結

0x08 參考

You may also like...

近期文章

分類

彙整

[源碼解析] GroupReduce，GroupCombine 和 Flink SQL group by

[源碼解析] GroupReduce，GroupCombine和Flink SQL group by

0x00 摘要

0x01 緣由

0x02 概念

2.1 GroupReduce

2.2 GroupCombine

2.3 例子

0x03 代碼

0x04 Flink SQL內部翻譯

0x05 JobGraph

0x06 Runtime

6.1 ChainedFlatMapDriver

6.2 GroupReduceCombineDriver

6.3 GroupReduceDriver & ChainedFlatMapDriver

0x07 總結

0x08 參考

You may also like...

【聖誕節活動懶人包】聖誕節活動、聖誕市集、聖誕演唱會一次看-2022

Google緊急示警！數百萬用戶未更新WinRAR，你的電腦可能正門戶大開

米蘭國際摩托車展直擊，電動機車群雄並起

近期文章

標籤

分類

彙整