parquet-tools version 1.8.2 supports merge command.
[root@emr-header-1 ~]# hadoop jar parquet-tools-1.10.1.jar merge -help
Merges multiple Parquet files into one. The command doesn't merge row groups,
just places one after the other. When used to merge many small files, the
resulting file will still contain small row groups, which usually leads to bad
usage: parquet-merge [option...] <input> [<input> ...] <output>
where option is one of:
--debug Enable debug output
-h,--help Show this help string
--no-color Disable color output even if supported
where <input> is the source parquet files/directory to be merged
<output> is the destination parquet file
we strongly recommend *not* to use parquet-tools merge unless you really know what you’re doing. It is known to cause some pretty bad performance problems in some cases. The problem is that it takes the row groups from the existing file and moves them unmodified into a new file – it does *not* merge the row groups from the different files. This can actually give you the worst of both worlds – you lose parallelism because the files are big, but you have all the performance overhead of processing many small row groups.
val parquetFileDF = spark.read.parquet("hdfs://emr-header-1.cluster-149038:9000/path_to_parquet_files/")
val rows = parquetFileDF.coalesce(1)