大道至简,知易行难
广阔天地,大有作为

合并多个parquet文件

一、parquet-tools
首先考虑使用parquet-tools。根据参考文档0和参考文档1中的说法:

parquet-tools version 1.8.2 supports merge command.

其使用的命令为:

参考文档2中也明确地提到:

we strongly recommend *not* to use parquet-tools merge unless you really know what you’re doing. It is known to cause some pretty bad performance problems in some cases. The problem is that it takes the row groups from the existing file and moves them unmodified into a new file – it does *not* merge the row groups from the different files. This can actually give you the worst of both worlds – you lose parallelism because the files are big, but you have all the performance overhead of processing many small row groups.

因此使用parquet-tools并不可取。

二、Spark
直接使用Spark:

注意:all-in-one-parquet-directory是一个目录,生成的parquet文件在其内部!

参考文档:
0、https://stackoverflow.com/questions/44400331/merge-two-parquet-files-in-hdfs
1、https://community.cloudera.com/t5/Support-Questions/Merging-many-Parquet-files/td-p/48892
2、https://community.cloudera.com/t5/Support-Questions/combine-small-parquet-files/td-p/33525/page/2

转载时请保留出处,违法转载追究到底:进城务工人员小梅 » 合并多个parquet文件

分享到:更多 ()

评论 抢沙发

  • 昵称 (必填)
  • 邮箱 (必填)
  • 网址