Welcome to Ranga Reddy Blog!

Data really powers everything that we do

Spark Submit Command generator using Iceberg Catalog

2023-07-15

Ranga Reddy

Spark

Spark Utilities Iceberg
- Spark Submit Command generator using different Iceberg Catalog(s)
Spark Submit Command generator using different Iceberg Catalog(s)

This tool is used to generate or build the Spark Submit Command using Iceberg Catalog(s).
Spark Submit Command generator using Iceberg Catalog

Spark Submit Command generator using Iceberg Catalog

Catalog Name:

Spark Version:

Iceberg Version:

Scala Version:

Catalog Type:

Hive Metastore Uri:

Warehouse Path:

Rest Catalog Uri:

spark.sql.catalog.hadoop_prod.warehouse = hdfs://nn:8020/warehouse/path*/ Hadoop Catalog Log:

Glue Catalog Log:

Nessie Catalog Log:

Spark Submit Command
Read All
Spark Streaming Kafka Batch Size Calculator

2023-04-25

Ranga Reddy

Spark

Spark Utilities Streaming
- Spark Streaming Kafka Batch Size Calculator
Spark Streaming Kafka Batch Size Calculator

Used to calculate the Spark Streaming Kafka Batch Size.
Spark Shuffle Partition Generator

Spark Streaming Kafka Batch Size Configuration

Number of Kafka Partitions:

Batch Duration (seconds):

Max Rate Per Partition (records/sec):

Maximum Kafka Messages to Fetch per Batch is
Read All

Linux Useful Commands

2023-03-03

Ranga Reddy

Linux

Miscellaneous Linux

Linux Commands
- Kill the all processes in Linux

Linux Commands

Kill the all processes in Linux

1. Finding the process id

Syntax:

ps aux | grep <process_name>

The aux options are as follows:

a = show processes for all users
u = display the process’s user/owner
x = also show processes not attached to a terminal

Example:

ps aux | grep java | grep -v grep

Sample Output:

livy       76620  0.1  0.7 4123864 162124 ?      Sl   Mar01   2:51 /usr/lib/jvm/java-11-openjdk-11.0.17.0.8-2.el7_9.x86_64/bin/java -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPa
th=/tmp/livy_livy-LIVY_SERVER-e75484af0f16d6e5808d3b5e09cec82b_pid76620.hprof -XX:OnOutOfMemoryError=/opt/cloudera/cm-agent/service/common/killparent.sh -Xmx67108864 -Dsun.securi
ty.krb5.disableReferrals=true -Djdk.tls.ephemeralDHKeySize=2048 -cp /opt/cloudera/parcels/CDH-7.1.7-1.cdh7.1.7.p1000.24102687/lib/livy2/jars/*:/var/run/cloudera-scm-agent/process
/1546345579-livy-LIVY_SERVER/livy-conf:/var/run/cloudera-scm-agent/process/1546345579-livy-LIVY_SERVER/spark-conf:/var/run/cloudera-scm-agent/process/1546345579-livy-LIVY_SERVER/
spark-conf/yarn-conf: org.apache.livy.server.LivyServer

2. Killing the process

There are two commands used to kill a process:

kill – Kill a process by ID
killall – Kill a process by name

1. Kill the process using ProcessID

Syntax:

kill SIGNAL PID

Example:

kill -9 76620

2. Kill the process using ProcessName

Syntax:

killall SIGNAL ProcessName

Example:

killall -9 chrome

All steps in single line

export PROCESS_NAME=java
ps -ef | grep $PROCESS_NAME | grep -v grep | awk '{print $2}' | xargs kill

Read All

Parquet Tools

2023-01-12

Ranga Reddy

Tools

Tools Parquet

Introduction
parquet-tools jar
- Download the parquet-tools jar
- Build the parquet-tools jar
Usage
- Using java command
- Using hadoop command
  - Help
Commands
References

Introduction

parquet-tool is a simple java based tool to extract the data and metadata (file metadata, column (chunk) metadata and page header metadata) from a Parquet file(s). We can extract the parquet file information from local or S3/HDFS.

parquet-tools jar

There are two ways we can get the parquet-tools.jar

Download the parquet-tools.jar (or)
Build the parquet-tools.jar

Download the parquet-tools jar

The easiest way to download the parquet-tools jar is from maven central repo.

Example:

wget -O parquet-tools.jar https://repo1.maven.org/maven2/org/apache/parquet/parquet-tools/1.11.2/parquet-tools-1.11.2.jar

Build the parquet-tools jar

Another way to get the parquet tool jar is building from the source code.

Thrift server needs to be installed before building the maven project.

git clone https://github.com/apache/parquet-mr.git
cd parquet-mr
git checkout apache-parquet-1.11.2
cd parquet-tools/
mvn clean package -Plocal
cp target/parquet-tools-1.11.2.jar parquet-tools.jar

-Plocal adds the required dependencies to the classpath.

Usage

Using `java` command

java -jar parquet-tools.jar <COMMAND> [option...] <input>

In order to use java command with the parquet-tool, we need to add additional libraries to the classpath. For example

java -cp commons-cli.jar:commons-collections.jar:commons-configuration.jar:commons-io.jar:commons-lang.jar:commons-logging.jar:guava.jar:hadoop-core.jar:jackson-core.jar:jackson-core-asl.jar:jackson-databind.jar:jackson-mapper-asl.jar:parquet-format-*-incubating.jar:parquet-hadoop-*.jar -jar parquet-tools.jar

Using `hadoop` command

hadoop jar parquet-tools.jar <COMMAND> [option...] <input>

Help

parquet-tools.jar print help when invoked without parameters or with “-help” or “–h” parameter:

hadoop jar parquet-tools.jar --help

To print the help of a specific command use the following syntax:

hadoop jar parquet-tools.jar <COMMAND> --help

Commands

Commands:

Name	Description
cat	Prints out content for a given parquet file
dump	Prints out row groups and metadata for a given parquet file
head	Prints out the first n records for a given parquet file
help	Prints this message or the help of the given subcommand(s)
merge	Merges multiple Parquet files into one Parquet file
meta	Prints out metadata for a given parquet file
rowcount	Prints the count of rows in Parquet file(s)
schema	Prints out the schema for a given parquet file
size	Prints the size of Parquet file(s)

Generic options:

Name	Description
–debug	Enable debug output.
-h,–help	Show this help string.
–no-color	Disable color output even if supported.

To run it on hadoop, you should use “hadoop jar” instead of “java jar”

`cat` command

Prints the content of a Parquet file. The output contains only the data, no metadata is displayed

Usage:

hadoop jar parquet-tools.jar cat [option...] <input>

where option is one of:

       --debug     Enable debug output
    -h,--help      Show this help string
    -j,--json      Show records in JSON format.
       --no-color  Disable color output even if supported

where <input> is the parquet file to print to stdout

Example:

$ hadoop jar parquet-tools.jar cat hdfs://localhost:8020/employees.parquet

2023-01-17 11:34:34,451 INFO hadoop.InternalParquetRecordReader: RecordReader initialized will read a total of 10 records.
2023-01-17 11:34:34,452 INFO hadoop.InternalParquetRecordReader: at row 0. reading next block
2023-01-17 11:34:34,487 INFO hadoop.InternalParquetRecordReader: block read in memory in 35 ms. row count = 10
email = rangareddy@yahoo.com
employee_id = 1
first_name = Ranga
hire_date = 21-Jun-07
last_name = Reddy
phone_number = 99509833
salary = 2600

email = raja@gmail.com
employee_id = 2
first_name = Raja Sekhar
hire_date = 13-Jan-08
last_name = Reddy
manager_id = 1
phone_number = 75050798
salary = 2600
...........

Print the content in json format.

hadoop jar parquet-tools.jar cat --json hdfs://localhost:8020/employees.parquet

2023-01-18 04:17:47,118 INFO hadoop.InternalParquetRecordReader: RecordReader initialized will read a total of 10 records.
2023-01-18 04:17:47,118 INFO hadoop.InternalParquetRecordReader: at row 0. reading next block
2023-01-18 04:17:47,149 INFO hadoop.InternalParquetRecordReader: block read in memory in 31 ms. row count = 10
{"email":"rangareddy@yahoo.com","employee_id":1,"first_name":"Ranga","hire_date":"21-Jun-07","last_name":"Reddy","phone_number":99509833,"salary":2600}
{"email":"raja@gmail.com","employee_id":2,"first_name":"Raja Sekhar","hire_date":"13-Jan-08","last_name":"Reddy","manager_id":1,"phone_number":75050798,"salary":2600}
{"email":"vasu@gmail.com","employee_id":3,"first_name":"Vasundra","hire_date":"17-Sep-03","last_name":"Reddy","manager_id":1,"phone_number":91512344,"salary":4400}
{"email":"meena@test.com","employee_id":4,"first_name":"Meena","hire_date":"17-Feb-04","last_name":"P","manager_id":2,"phone_number":81535555,"salary":13000}
{"email":"manu@rediff.com","employee_id":5,"first_name":"Manoj","hire_date":"17-Aug-05","last_name":"Kumar","manager_id":3,"phone_number":60312366,"salary":6000}
{"email":"vinod@zoho.com","employee_id":6,"first_name":"Vinod","hire_date":"07-Jun-02","last_name":"Kumar","manager_id":3,"phone_number":71237777,"salary":6500}
{"email":"rajar@yahoo.co.in","employee_id":7,"first_name":"Raja","hire_date":"07-Jun-02","last_name":"Reddy","manager_id":4,"phone_number":91518888,"salary":10000}
{"email":"shiva@mymail.com","employee_id":8,"first_name":"Shiva","hire_date":"07-Jun-02","last_name":"P","manager_id":6,"phone_number":81512380,"salary":12008}
{"email":"babu@mail.com","employee_id":9,"first_name":"Reddy","hire_date":"07-Jun-02","last_name":"Babu","manager_id":7,"phone_number":91528181,"salary":8300}
{"email":"nish@nish.com","employee_id":10,"first_name":"Nishanth","hire_date":"17-Jun-03","last_name":"Reddy","manager_id":2,"phone_number":61512347,"salary":24000}

`head` command

Prints the first n record of the Parquet file (default: 5)

Usage:

hadoop jar parquet-tools.jar head [option...] <input>

where option is one of:

       --debug          Enable debug output
    -h,--help           Show this help string
    -n,--records <arg>  The number of records to show (default: 5)
       --no-color       Disable color output even if supported

where <input> is the parquet file to print to stdout

Example:

$ hadoop jar parquet-tools.jar head hdfs://localhost:8020/employees.parquet

2023-01-17 11:36:14,668 INFO hadoop.InternalParquetRecordReader: RecordReader initialized will read a total of 10 records.
2023-01-17 11:36:14,668 INFO hadoop.InternalParquetRecordReader: at row 0. reading next block
2023-01-17 11:36:14,699 INFO hadoop.InternalParquetRecordReader: block read in memory in 31 ms. row count = 10
email = rangareddy@yahoo.com
employee_id = 1
first_name = Ranga
hire_date = 21-Jun-07
last_name = Reddy
phone_number = 99509833
salary = 2600

email = raja@gmail.com
employee_id = 2
first_name = Raja Sekhar
hire_date = 13-Jan-08
last_name = Reddy
manager_id = 1
phone_number = 75050798
salary = 2600

email = vasu@gmail.com
employee_id = 3
first_name = Vasundra
hire_date = 17-Sep-03
last_name = Reddy
manager_id = 1
phone_number = 91512344
salary = 4400

email = meena@test.com
employee_id = 4
first_name = Meena
hire_date = 17-Feb-04
last_name = P
manager_id = 2
phone_number = 81535555
salary = 13000

email = manu@rediff.com
employee_id = 5
first_name = Manoj
hire_date = 17-Aug-05
last_name = Kumar
manager_id = 3
phone_number = 60312366
salary = 6000

Print the top 2 records

$ hadoop jar parquet-tools.jar head -n 2 hdfs://localhost:8020/employees.parquet

2023-01-17 11:37:15,303 INFO hadoop.InternalParquetRecordReader: RecordReader initialized will read a total of 10 records.
2023-01-17 11:37:15,303 INFO hadoop.InternalParquetRecordReader: at row 0. reading next block
2023-01-17 11:37:15,330 INFO hadoop.InternalParquetRecordReader: block read in memory in 26 ms. row count = 10
email = rangareddy@yahoo.com
employee_id = 1
first_name = Ranga
hire_date = 21-Jun-07
last_name = Reddy
phone_number = 99509833
salary = 2600

email = raja@gmail.com
employee_id = 2
first_name = Raja Sekhar
hire_date = 13-Jan-08
last_name = Reddy
manager_id = 1
phone_number = 75050798
salary = 2600

`schema` command

Prints the schema of Parquet file(s)

Usage:

hadoop jar parquet-tools.jar schema [option...] <input>

where option is one of:

    -d,--detailed      Show detailed information about the schema.
       --debug         Enable debug output
    -h,--help          Show this help string
       --no-color      Disable color output even if supported
    -o,--originalType  Print logical types in OriginalType representation.

where <input> is the parquet file containing the schema to show

Example:

$ hadoop jar parquet-tools.jar schema hdfs://localhost:8020/employees.parquet

message spark_schema {
  optional binary email (STRING);
  optional int64 employee_id;
  optional binary first_name (STRING);
  optional binary hire_date (STRING);
  optional binary last_name (STRING);
  optional int64 manager_id;
  optional int64 phone_number;
  optional int64 salary;
}

`meta` command

Prints the metadata of Parquet file(s)

Usage:

hadoop jar parquet-tools.jar meta [option...] <input>

where option is one of:

       --debug         Enable debug output
    -h,--help          Show this help string
       --no-color      Disable color output even if supported
    -o,--originalType  Print logical types in OriginalType representation.

where <input> is the parquet file to print to stdout

Example:

$ hadoop jar parquet-tools.jar meta hdfs://localhost:8020/employees.parquet

2023-01-17 11:38:51,086 INFO hadoop.ParquetFileReader: Initiating action with parallelism: 5
2023-01-17 11:38:51,089 INFO hadoop.ParquetFileReader: reading another 1 footers
2023-01-17 11:38:51,092 INFO hadoop.ParquetFileReader: Initiating action with parallelism: 5
file:         hdfs://localhost:8020/employees.parquet 
creator:      parquet-mr version 1.10.99.7.1.7.1000-141 (build 12da67a00623b3abf03a62026e8d6d61dc21da37) 
extra:        org.apache.spark.version = 2.4.7 
extra:        org.apache.spark.sql.parquet.row.metadata = {"type":"struct","fields":[{"name":"email","type":"string","nullable":true,"metadata":{}},{"name":"employee_id","type":"long","nullable":true,"metadata":{}},{"name":"first_name","type":"string","nullable":true,"metadata":{}},{"name":"hire_date","type":"string","nullable":true,"metadata":{}},{"name":"last_name","type":"string","nullable":true,"metadata":{}},{"name":"manager_id","type":"long","nullable":true,"metadata":{}},{"name":"phone_number","type":"long","nullable":true,"metadata":{}},{"name":"salary","type":"long","nullable":true,"metadata":{}}]} 

file schema:  spark_schema 
--------------------------------------------------------------------------------
email:        OPTIONAL BINARY L:STRING R:0 D:1
employee_id:  OPTIONAL INT64 R:0 D:1
first_name:   OPTIONAL BINARY L:STRING R:0 D:1
hire_date:    OPTIONAL BINARY L:STRING R:0 D:1
last_name:    OPTIONAL BINARY L:STRING R:0 D:1
manager_id:   OPTIONAL INT64 R:0 D:1
phone_number: OPTIONAL INT64 R:0 D:1
salary:       OPTIONAL INT64 R:0 D:1

row group 1:  RC:10 TS:959 OFFSET:4 
--------------------------------------------------------------------------------
email:         BINARY UNCOMPRESSED DO:0 FPO:4 SZ:215/215/1.00 VC:10 ENC:RLE,BIT_PACKED,PLAIN ST:[min: babu@mail.com, max: vinod@zoho.com, num_nulls: 0]
employee_id:   INT64 UNCOMPRESSED DO:0 FPO:219 SZ:105/105/1.00 VC:10 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 1, max: 10, num_nulls: 0]
first_name:    BINARY UNCOMPRESSED DO:0 FPO:324 SZ:126/126/1.00 VC:10 ENC:RLE,BIT_PACKED,PLAIN ST:[min: Manoj, max: Vinod, num_nulls: 0]
hire_date:     BINARY UNCOMPRESSED DO:0 FPO:450 SZ:137/137/1.00 VC:10 ENC:RLE,BIT_PACKED,PLAIN_DICTIONARY ST:[min: 07-Jun-02, max: 21-Jun-07, num_nulls: 0]
last_name:     BINARY UNCOMPRESSED DO:0 FPO:587 SZ:73/73/1.00 VC:10 ENC:RLE,BIT_PACKED,PLAIN_DICTIONARY ST:[min: Babu, max: Reddy, num_nulls: 0]
manager_id:    INT64 UNCOMPRESSED DO:0 FPO:660 SZ:93/93/1.00 VC:10 ENC:RLE,BIT_PACKED,PLAIN_DICTIONARY ST:[min: 1, max: 7, num_nulls: 1]
phone_number:  INT64 UNCOMPRESSED DO:0 FPO:753 SZ:105/105/1.00 VC:10 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 60312366, max: 99509833, num_nulls: 0]
salary:        INT64 UNCOMPRESSED DO:0 FPO:858 SZ:105/105/1.00 VC:10 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 2600, max: 24000, num_nulls: 0]

`dump` command

Prints the content and metadata of a Parquet file

Usage:

hadoop jar parquet-tools.jar dump [option...] <input>

where option is one of:

    -c,--column <arg>  Dump only the given column, can be specified more than
                       once
    -d,--disable-data  Do not dump column data
       --debug         Enable debug output
    -h,--help          Show this help string
    -m,--disable-meta  Do not dump row group and page metadata
    -n,--disable-crop  Do not crop the output based on console width
       --no-color      Disable color output even if supported

where <input> is the parquet file to print to stdout

Example:

$ hadoop jar parquet-tools.jar dump hdfs://localhost:8020/employees.parquet

row group 0 
--------------------------------------------------------------------------------
email:         BINARY UNCOMPRESSED DO:0 FPO:4 SZ:215/215/1.00 VC:10 EN [more]...
employee_id:   INT64 UNCOMPRESSED DO:0 FPO:219 SZ:105/105/1.00 VC:10 E [more]...
first_name:    BINARY UNCOMPRESSED DO:0 FPO:324 SZ:126/126/1.00 VC:10  [more]...
hire_date:     BINARY UNCOMPRESSED DO:0 FPO:450 SZ:137/137/1.00 VC:10  [more]...
last_name:     BINARY UNCOMPRESSED DO:0 FPO:587 SZ:73/73/1.00 VC:10 EN [more]...
manager_id:    INT64 UNCOMPRESSED DO:0 FPO:660 SZ:93/93/1.00 VC:10 ENC [more]...
phone_number:  INT64 UNCOMPRESSED DO:0 FPO:753 SZ:105/105/1.00 VC:10 E [more]...
salary:        INT64 UNCOMPRESSED DO:0 FPO:858 SZ:105/105/1.00 VC:10 E [more]...

    email TV=10 RL=0 DL=1
    ----------------------------------------------------------------------------
    page 0:                         DLE:RLE RLE:BIT_PACKED VLE:PLAIN S [more]... SZ:196

    employee_id TV=10 RL=0 DL=1
    ----------------------------------------------------------------------------
    page 0:                         DLE:RLE RLE:BIT_PACKED VLE:PLAIN S [more]... SZ:86

    first_name TV=10 RL=0 DL=1
    ----------------------------------------------------------------------------
    page 0:                         DLE:RLE RLE:BIT_PACKED VLE:PLAIN S [more]... SZ:107

    hire_date TV=10 RL=0 DL=1 DS:  7 DE:PLAIN_DICTIONARY
    ----------------------------------------------------------------------------
    page 0:                         DLE:RLE RLE:BIT_PACKED VLE:PLAIN_DICTIONARY [more]... SZ:14

    last_name TV=10 RL=0 DL=1 DS:  4 DE:PLAIN_DICTIONARY
    ----------------------------------------------------------------------------
    page 0:                         DLE:RLE RLE:BIT_PACKED VLE:PLAIN_DICTIONARY [more]... SZ:12

    manager_id TV=10 RL=0 DL=1 DS: 6 DE:PLAIN_DICTIONARY
    ----------------------------------------------------------------------------
    page 0:                         DLE:RLE RLE:BIT_PACKED VLE:PLAIN_DICTIONARY [more]... SZ:15

    phone_number TV=10 RL=0 DL=1
    ----------------------------------------------------------------------------
    page 0:                         DLE:RLE RLE:BIT_PACKED VLE:PLAIN S [more]... SZ:86

    salary TV=10 RL=0 DL=1
    ----------------------------------------------------------------------------
    page 0:                         DLE:RLE RLE:BIT_PACKED VLE:PLAIN S [more]... SZ:86

BINARY email 
--------------------------------------------------------------------------------
*** row group 1 of 1, values 1 to 10 *** 
value 1:  R:0 D:1 V:rangareddy@yahoo.com
value 2:  R:0 D:1 V:raja@gmail.com
value 3:  R:0 D:1 V:vasu@gmail.com
value 4:  R:0 D:1 V:meena@test.com
value 5:  R:0 D:1 V:manu@rediff.com
value 6:  R:0 D:1 V:vinod@zoho.com
value 7:  R:0 D:1 V:rajar@yahoo.co.in
value 8:  R:0 D:1 V:shiva@mymail.com
value 9:  R:0 D:1 V:babu@mail.com
value 10: R:0 D:1 V:nish@nish.com

INT64 employee_id 
--------------------------------------------------------------------------------
*** row group 1 of 1, values 1 to 10 *** 
value 1:  R:0 D:1 V:1
value 2:  R:0 D:1 V:2
value 3:  R:0 D:1 V:3
value 4:  R:0 D:1 V:4
value 5:  R:0 D:1 V:5
value 6:  R:0 D:1 V:6
value 7:  R:0 D:1 V:7
value 8:  R:0 D:1 V:8
value 9:  R:0 D:1 V:9
value 10: R:0 D:1 V:10

BINARY first_name 
--------------------------------------------------------------------------------
*** row group 1 of 1, values 1 to 10 *** 
value 1:  R:0 D:1 V:Ranga
value 2:  R:0 D:1 V:Raja Sekhar
value 3:  R:0 D:1 V:Vasundra
value 4:  R:0 D:1 V:Meena
value 5:  R:0 D:1 V:Manoj
value 6:  R:0 D:1 V:Vinod
value 7:  R:0 D:1 V:Raja
value 8:  R:0 D:1 V:Shiva
value 9:  R:0 D:1 V:Reddy
value 10: R:0 D:1 V:Nishanth

BINARY hire_date 
--------------------------------------------------------------------------------
*** row group 1 of 1, values 1 to 10 *** 
value 1:  R:0 D:1 V:21-Jun-07
value 2:  R:0 D:1 V:13-Jan-08
value 3:  R:0 D:1 V:17-Sep-03
value 4:  R:0 D:1 V:17-Feb-04
value 5:  R:0 D:1 V:17-Aug-05
value 6:  R:0 D:1 V:07-Jun-02
value 7:  R:0 D:1 V:07-Jun-02
value 8:  R:0 D:1 V:07-Jun-02
value 9:  R:0 D:1 V:07-Jun-02
value 10: R:0 D:1 V:17-Jun-03

BINARY last_name 
--------------------------------------------------------------------------------
*** row group 1 of 1, values 1 to 10 *** 
value 1:  R:0 D:1 V:Reddy
value 2:  R:0 D:1 V:Reddy
value 3:  R:0 D:1 V:Reddy
value 4:  R:0 D:1 V:P
value 5:  R:0 D:1 V:Kumar
value 6:  R:0 D:1 V:Kumar
value 7:  R:0 D:1 V:Reddy
value 8:  R:0 D:1 V:P
value 9:  R:0 D:1 V:Babu
value 10: R:0 D:1 V:Reddy

INT64 manager_id 
--------------------------------------------------------------------------------
*** row group 1 of 1, values 1 to 10 *** 
value 1:  R:0 D:0 V:<null>
value 2:  R:0 D:1 V:1
value 3:  R:0 D:1 V:1
value 4:  R:0 D:1 V:2
value 5:  R:0 D:1 V:3
value 6:  R:0 D:1 V:3
value 7:  R:0 D:1 V:4
value 8:  R:0 D:1 V:6
value 9:  R:0 D:1 V:7
value 10: R:0 D:1 V:2

INT64 phone_number 
--------------------------------------------------------------------------------
*** row group 1 of 1, values 1 to 10 *** 
value 1:  R:0 D:1 V:99509833
value 2:  R:0 D:1 V:75050798
value 3:  R:0 D:1 V:91512344
value 4:  R:0 D:1 V:81535555
value 5:  R:0 D:1 V:60312366
value 6:  R:0 D:1 V:71237777
value 7:  R:0 D:1 V:91518888
value 8:  R:0 D:1 V:81512380
value 9:  R:0 D:1 V:91528181
value 10: R:0 D:1 V:61512347

INT64 salary 
--------------------------------------------------------------------------------
*** row group 1 of 1, values 1 to 10 *** 
value 1:  R:0 D:1 V:2600
value 2:  R:0 D:1 V:2600
value 3:  R:0 D:1 V:4400
value 4:  R:0 D:1 V:13000
value 5:  R:0 D:1 V:6000
value 6:  R:0 D:1 V:6500
value 7:  R:0 D:1 V:10000
value 8:  R:0 D:1 V:12008
value 9:  R:0 D:1 V:8300
value 10: R:0 D:1 V:24000

`merge` command

Merges multiple Parquet files into one. The command doesn’t merge row groups, just places one after the other. When used to merge many small files, the resulting file will still contain small row groups, which usually leads to bad query performance.

Usage:

hadoop jar parquet-tools.jar merge [option...] <input> [<input> ...] <output>

where option is one of:

       --debug     Enable debug output
    -h,--help      Show this help string
       --no-color  Disable color output even if supported

where <input> is the source parquet files/directory to be merged <output> is the destination parquet file

Example:

$ hadoop jar parquet-tools.jar merge hdfs://localhost:8020/test_1.parquet hdfs://localhost:8020/test_2.parquet hdfs://localhost:8020/test_output.parquet

Warning: file hdfs://localhost:8020/test_1.parquet is too small, length: 490
Warning: file hdfs://localhost:8020/test_2.parquet is too small, length: 490
Warning: you merged too small files. Although the size of the merged file is bigger, it STILL contains small row groups, thus you don't have the advantage of big row groups, which usually leads to bad query performance!

Print the merged file content:

$ hadoop jar parquet-tools.jar cat --json hdfs://localhost:8020/test_output.parquet

2023-01-18 04:22:30,439 INFO hadoop.InternalParquetRecordReader: RecordReader initialized will read a total of 8 records.
2023-01-18 04:22:30,439 INFO hadoop.InternalParquetRecordReader: at row 0. reading next block
2023-01-18 04:22:30,465 INFO compress.CodecPool: Got brand-new decompressor [.snappy]
2023-01-18 04:22:30,482 INFO hadoop.InternalParquetRecordReader: block read in memory in 43 ms. row count = 4
{"id":1}
{"id":2}
{"id":3}
{"id":4}
2023-01-18 04:22:30,829 INFO hadoop.InternalParquetRecordReader: Assembled and processed 4 records from 1 columns in 46 ms: 0.08695652 rec/ms, 0.08695652 cell/ms
2023-01-18 04:22:30,829 INFO hadoop.InternalParquetRecordReader: time spent so far 48% reading (43 ms) and 51% processing (46 ms)
2023-01-18 04:22:30,829 INFO hadoop.InternalParquetRecordReader: at row 4. reading next block
2023-01-18 04:22:30,830 INFO hadoop.InternalParquetRecordReader: block read in memory in 1 ms. row count = 4
{"id":6}
{"id":7}
{"id":8}
{"id":9}

`rowcount` command

Print the count of rows in a Parquet file

Usage:

hadoop jar parquet-tools.jar rowcount [option...] <input>

where option is one of:

    -d,--detailed  Detailed rowcount of each matching file
       --debug     Enable debug output
    -h,--help      Show this help string
       --no-color  Disable color output even if supported

where <input> is the parquet file to count rows to stdout

Example:

$ hadoop jar parquet-tools.jar rowcount hdfs://localhost:8020/employees.parquet

2023-01-17 11:55:08,389 INFO hadoop.ParquetFileReader: Initiating action with parallelism: 5
2023-01-17 11:55:08,393 INFO hadoop.ParquetFileReader: reading another 1 footers
2023-01-17 11:55:08,395 INFO hadoop.ParquetFileReader: Initiating action with parallelism: 5
Total RowCount: 10

`size` command

Prints the size of Parquet file(s)

Usage:

hadoop jar parquet-tools.jar size [option...] <input>

where option is one of:

    -d,--detailed      Detailed size of each matching file
       --debug         Enable debug output
    -h,--help          Show this help string
       --no-color      Disable color output even if supported
    -p,--pretty        Pretty size
    -u,--uncompressed  Uncompressed size

where <input> is the parquet file to get size & human readable size to stdout

Example:

$ hadoop jar parquet-tools.jar size hdfs://localhost:8020/employees.parquet

2023-01-17 11:55:59,432 INFO hadoop.ParquetFileReader: Initiating action with parallelism: 5
2023-01-17 11:55:59,578 INFO hadoop.ParquetFileReader: reading another 1 footers
2023-01-17 11:55:59,642 INFO hadoop.ParquetFileReader: Initiating action with parallelism: 5
Total Size: 959 bytes

References

https://github.com/apache/parquet-mr/tree/parquet-1.8.0rc1/parquet-tools/
https://mvnrepository.com/artifact/org.apache.parquet/parquet-tools

Read All

Spark Submit Command Formatter tool

2023-01-05

Ranga Reddy

Spark

Spark Utilities
- Spark Submit Command Formatter/Minifier
Spark Submit Command Formatter/Minifier

Used to format/minify the Spark Submit command and generate it in beautiful/minify format.
Spark Configuration Generator

Spark Submit Command

Formatted Spark Submit Command

Spark Submit Command Parameters

Parameter Name Parameter Value

Spark Submit Additional (Command Line) Parameters

Parameter Name Parameter Value
Read All

First Previous 1/3 Next Last

Welcome to Ranga Reddy Blog!

Spark Submit Command generator using Iceberg Catalog

Spark Submit Command generator using different Iceberg Catalog(s)

Spark Submit Command generator using Iceberg Catalog

Spark Submit Command

Spark Streaming Kafka Batch Size Calculator

Spark Streaming Kafka Batch Size Calculator

Spark Streaming Kafka Batch Size Configuration

Linux Useful Commands

Linux Commands

Kill the all processes in Linux

1. Finding the process id

2. Killing the process

1. Kill the process using ProcessID

2. Kill the process using ProcessName

All steps in single line

Parquet Tools

Introduction

parquet-tools jar

Download the parquet-tools jar

Build the parquet-tools jar

Usage

Using java command

Using hadoop command

Help

Commands

cat command

head command

schema command

meta command

dump command

merge command

rowcount command

size command

References

Spark Submit Command Formatter tool

Spark Submit Command Formatter/Minifier

Spark Submit Command

Formatted Spark Submit Command

Spark Submit Command Parameters

Spark Submit Additional (Command Line) Parameters

Using `java` command

Using `hadoop` command

`cat` command

`head` command

`schema` command

`meta` command

`dump` command

`merge` command

`rowcount` command

`size` command