Ray.io + Apache Hudi

5 min readApr 9, 2025

Historically, documentation for using Apache Hudi with Ray.io has been limited due to the absence of a native Python library for Hudi. However, the recent release of hudi-rs now enables Python applications to read Hudi datasets.

The following example demonstrates how to execute a Ray job that reads Apache Hudi files. Crucially, ensure runtime enviornment has the hudi Python package using pip install hudi and that AWS region has been set in the storage_options before running this code. The example includes an optional parameter to force the job to use 2 GPUs. You can remove this parameter if GPU acceleration is not required.

Ray Job example

ray job submit --num-gpus=2 --runtime-env-json='{"pip": ["hudi"]}' --address http://localhost:8265 -- python -c "import ray; storage_options = {\"aws_region\": \"us-west-2\"}; tableUri = \"s3a://albertbucket/albertfolder/albertdb/alberttable/v1\"; ds = ray.data.read_hudi(table_uri=tableUri,storage_options=storage_options);print(ds.show())"

Ray Job example using a python file

Save this as task.py

import ray
import requests

@ray.remote
def get_requests_version():
storage_options = {"aws_region": "us-west-2"}
tableUri = "s3a://albertbucket/albertfolder/albertdb/alberttable/v1"
ray.data.read_hudi(table_uri=tableUri,storage_options=storage_options).show(20)
return requests.__version__

ray.init()
print("requests version:", ray.get(get_requests_version.remote()))

Run

export RAY_ADDRESS="http://127.0.0.1:8265"   
ray job submit --working-dir . --runtime-env-json='{"pip": ["hudi"]}' -- python task.py

output

2025-04-28 10:07:00,577 INFO job_manager.py:530 -- Runtime env is setting up.
2025-04-28 10:07:02,441 INFO worker.py:1514 -- Using address 10.0.118.196:6379 set in the environment variable RAY_ADDRESS
2025-04-28 10:07:02,441 INFO worker.py:1654 -- Connecting to existing Ray cluster at address: 10.0.118.196:6379...
2025-04-28 10:07:02,452 INFO worker.py:1832 -- Connected to Ray cluster. View the dashboard at 10.0.118.196:8265 
2025-04-28 10:08:31,203 INFO dataset.py:2699 -- Tip: Use `take_batch()` instead of `take() / show()` to return records in pandas or numpy batch format.
2025-04-28 10:09:53,305 INFO streaming_executor.py:108 -- Starting execution of Dataset. Full logs are in /tmp/ray/session_2025-04-28_00-10-57_134539_1/logs/ray-data
2025-04-28 10:09:53,305 INFO streaming_executor.py:109 -- Execution plan of Dataset: InputDataBuffer[Input] -> TaskPoolMapOperator[ReadHudi] -> LimitOperator[limit=20]

Running 0: 0.00 row [00:00, ? row/s]

- ReadHudi->SplitBlocks(2) 1: 0.00 row [00:00, ? row/s]


- limit=20 2: 0.00 row [00:00, ? row/s]
Running Dataset. Active & requested resources: 0/4 CPU, 0.0B/2.4GB object store: : 0.00 row [01:21, ? row/s]
Running Dataset. Active & requested resources: 0/4 CPU, 0.0B/2.4GB object store: : 0.00 row [01:21, ? row/s]

- ReadHudi->SplitBlocks(2): Tasks: 4 [backpressured]; Queued blocks: 150; Resources: 4.0 CPU, 1.0GB object store: : 0.00 row [00:01, ? row/s]

- ReadHudi->SplitBlocks(2): Tasks: 4 [backpressured]; Queued blocks: 150; Resources: 4.0 CPU, 1.0GB object store: : 0.00 row [00:01, ? row/s]


- limit=20: Tasks: 0; Queued blocks: 0; Resources: 0.0 CPU, 0.0B object store: : 0.00 row [00:01, ? row/s]


- limit=20: Tasks: 0; Queued blocks: 0; Resources: 0.0 CPU, 0.0B object store: : 0.00 row [00:01, ? row/s]
Running Dataset. Active & requested resources: 4/4 CPU, 1.0GB/2.4GB object store: : 0.00 row [01:22, ? row/s]
Running Dataset. Active & requested resources: 4/4 CPU, 1.0GB/2.4GB object store: : 0.00 row [01:22, ? row/s]

- ReadHudi->SplitBlocks(2): Tasks: 4 [backpressured]; Queued blocks: 150; Resources: 4.0 CPU, 1.0GB object store: : 0.00 row [00:02, ? row/s]


- limit=20: Tasks: 0; Queued blocks: 0; Resources: 0.0 CPU, 0.0B object store: : 0.00 row [00:02, ? row/s]
Running Dataset. Active & requested resources: 4/4 CPU, 1.0GB/2.4GB object store: : 0.00 row [01:23, ? row/s]
✔️ Dataset execution finished in 84.35 seconds: : 0.00 row [01:24, ? row/s]
✔️ Dataset execution finished in 84.35 seconds: 0%| | 0.00/20.0 [01:24<?, ? row/s]
✔️ Dataset execution finished in 84.35 seconds: 100%|██████████| 20.0/20.0 [01:24<00:00, 4.22s/ row]
✔️ Dataset execution finished in 84.35 seconds: 100%|██████████| 20.0/20.0 [01:24<00:00, 4.22s/ row]







✔️ Dataset execution finished in 84.35 seconds: 100%|██████████| 20.0/20.0 [01:24<00:00, 4.22s/ row]


- ReadHudi->SplitBlocks(2): Tasks: 4 [backpressured]; Queued blocks: 150; Resources: 4.0 CPU, 1.0GB object store: : 0.00 row [00:02, ? row/s]


- limit=20: Tasks: 0; Queued blocks: 0; Resources: 0.0 CPU, 0.0B object store: : 0.00 row [00:02, ? row/s]

- ReadHudi->SplitBlocks(2): Tasks: 4 [backpressured]; Queued blocks: 147; Resources: 4.0 CPU, 265.6KB object store: : 0.00 row [00:02, ? row/s]

- ReadHudi->SplitBlocks(2): Tasks: 4 [backpressured]; Queued blocks: 147; Resources: 4.0 CPU, 265.6KB object store: 0%| | 0.00/583 [00:02<?, ? row/s]

- ReadHudi->SplitBlocks(2): Tasks: 4 [backpressured]; Queued blocks: 147; Resources: 4.0 CPU, 265.6KB object store: 100%|██████████| 583/583 [00:02<00:00, 204 row/s]

- ReadHudi->SplitBlocks(2): Tasks: 4 [backpressured]; Queued blocks: 147; Resources: 4.0 CPU, 265.6KB object store: 100%|██████████| 583/583 [00:02<00:00, 204 row/s]







- ReadHudi->SplitBlocks(2): Tasks: 4 [backpressured]; Queued blocks: 147; Resources: 4.0 CPU, 265.6KB object store: 100%|██████████| 583/583 [00:02<00:00, 204 row/s]



- limit=20: Tasks: 0; Queued blocks: 0; Resources: 0.0 CPU, 0.0B object store: : 0.00 row [00:02, ? row/s]


- limit=20: Tasks: 0; Queued blocks: 4; Resources: 0.0 CPU, 0.0B object store: : 0.00 row [00:02, ? row/s]


- limit=20: Tasks: 0; Queued blocks: 4; Resources: 0.0 CPU, 0.0B object store: 0%| | 0.00/20.0 [00:02<?, ? row/s]


- limit=20: Tasks: 0; Queued blocks: 4; Resources: 0.0 CPU, 0.0B object store: 100%|██████████| 20.0/20.0 [00:02<00:00, 7.01 row/s]


- limit=20: Tasks: 0; Queued blocks: 4; Resources: 0.0 CPU, 0.0B object store: 100%|██████████| 20.0/20.0 [00:02<00:00, 7.01 row/s]




- limit=20: Tasks: 0; Queued blocks: 4; Resources: 0.0 CPU, 0.0B object store: 100%|██████████| 20.0/20.0 [00:02<00:00, 7.01 row/s]
{'_hoodie_commit_time': '20250320165301144', '_hoodie_commit_seqno': '20250320165301144_5_0', '_hoodie_record_key': '20250320165301144_2_21', '_hoodie_partition_path': '', '_hoodie_file_name': 'b3ab5aad-631a-44bf-8264-6873eb5d33e8-0_5-725-1651_20250320165301144.parquet', 'event': 'pageview', 'properties': '{"user_id":919,"page_url":"www.google.com","view_count":1}'}
{'_hoodie_commit_time': '20250320165301144', '_hoodie_commit_seqno': '20250320165301144_5_1', '_hoodie_record_key': '20250320165301144_2_22', '_hoodie_partition_path': '', '_hoodie_file_name': 'b3ab5aad-631a-44bf-8264-6873eb5d33e8-0_5-725-1651_20250320165301144.parquet', 'event': 'click', 'properties': '{"user_id":907,"page_url":"www.instagram.com","view_count":59}'}
{'_hoodie_commit_time': '20250320165301144', '_hoodie_commit_seqno': '20250320165301144_5_2', '_hoodie_record_key': '20250320165301144_2_23', '_hoodie_partition_path': '', '_hoodie_file_name': 'b3ab5aad-631a-44bf-8264-6873eb5d33e8-0_5-725-1651_20250320165301144.parquet', 'event': 'conversion', 'properties': '{"user_id":398,"page_url":"www.amazon.com","view_count":2}'}
{'_hoodie_commit_time': '20250320165301144', '_hoodie_commit_seqno': '20250320165301144_5_3', '_hoodie_record_key': '20250320165301144_2_24', '_hoodie_partition_path': '', '_hoodie_file_name': 'b3ab5aad-631a-44bf-8264-6873eb5d33e8-0_5-725-1651_20250320165301144.parquet', 'event': 'click', 'properties': '{"user_id":357,"page_url":"www.youtube.com","view_count":86}'}
{'_hoodie_commit_time': '20250320165301144', '_hoodie_commit_seqno': '20250320165301144_5_4', '_hoodie_record_key': '20250320165301144_2_25', '_hoodie_partition_path': '', '_hoodie_file_name': 'b3ab5aad-631a-44bf-8264-6873eb5d33e8-0_5-725-1651_20250320165301144.parquet', 'event': 'conversion', 'properties': '{"user_id":788,"page_url":"www.tumblr.com","view_count":67}'}
{'_hoodie_commit_time': '20250320165301144', '_hoodie_commit_seqno': '20250320165301144_5_5', '_hoodie_record_key': '20250320165301144_2_26', '_hoodie_partition_path': '', '_hoodie_file_name': 'b3ab5aad-631a-44bf-8264-6873eb5d33e8-0_5-725-1651_20250320165301144.parquet', 'event': 'conversion', 'properties': '{"user_id":165,"page_url":"www.google.com","view_count":28}'}
{'_hoodie_commit_time': '20250320165301144', '_hoodie_commit_seqno': '20250320165301144_5_6', '_hoodie_record_key': '20250320165301144_2_27', '_hoodie_partition_path': '', '_hoodie_file_name': 'b3ab5aad-631a-44bf-8264-6873eb5d33e8-0_5-725-1651_20250320165301144.parquet', 'event': 'conversion', 'properties': '{"user_id":66,"page_url":"www.linkedin.com","view_count":65}'}
{'_hoodie_commit_time': '20250320165301144', '_hoodie_commit_seqno': '20250320165301144_5_7', '_hoodie_record_key': '20250320165301144_2_28', '_hoodie_partition_path': '', '_hoodie_file_name': 'b3ab5aad-631a-44bf-8264-6873eb5d33e8-0_5-725-1651_20250320165301144.parquet', 'event': 'pageview', 'properties': '{"user_id":569,"page_url":"www.instagram.com","view_count":64}'}
{'_hoodie_commit_time': '20250320165301144', '_hoodie_commit_seqno': '20250320165301144_5_8', '_hoodie_record_key': '20250320165301144_2_29', '_hoodie_partition_path': '', '_hoodie_file_name': 'b3ab5aad-631a-44bf-8264-6873eb5d33e8-0_5-725-1651_20250320165301144.parquet', 'event': 'click', 'properties': '{"user_id":702,"page_url":"www.facebook.com","view_count":63}'}
{'_hoodie_commit_time': '20250320165301144', '_hoodie_commit_seqno': '20250320165301144_5_9', '_hoodie_record_key': '20250320165301144_2_3', '_hoodie_partition_path': '', '_hoodie_file_name': 'b3ab5aad-631a-44bf-8264-6873eb5d33e8-0_5-725-1651_20250320165301144.parquet', 'event': 'conversion', 'properties': '{"user_id":763,"page_url":"www.twitter.com","view_count":60}'}
{'_hoodie_commit_time': '20250320165301144', '_hoodie_commit_seqno': '20250320165301144_5_10', '_hoodie_record_key': '20250320165301144_2_30', '_hoodie_partition_path': '', '_hoodie_file_name': 'b3ab5aad-631a-44bf-8264-6873eb5d33e8-0_5-725-1651_20250320165301144.parquet', 'event': 'conversion', 'properties': '{"user_id":679,"page_url":"www.tumblr.com","view_count":9}'}
{'_hoodie_commit_time': '20250320165301144', '_hoodie_commit_seqno': '20250320165301144_5_11', '_hoodie_record_key': '20250320165301144_2_31', '_hoodie_partition_path': '', '_hoodie_file_name': 'b3ab5aad-631a-44bf-8264-6873eb5d33e8-0_5-725-1651_20250320165301144.parquet', 'event': 'conversion', 'properties': '{"user_id":274,"page_url":"www.instagram.com","view_count":64}'}
{'_hoodie_commit_time': '20250320165301144', '_hoodie_commit_seqno': '20250320165301144_5_12', '_hoodie_record_key': '20250320165301144_2_32', '_hoodie_partition_path': '', '_hoodie_file_name': 'b3ab5aad-631a-44bf-8264-6873eb5d33e8-0_5-725-1651_20250320165301144.parquet', 'event': 'click', 'properties': '{"user_id":369,"page_url":"www.pinterest.com","view_count":54}'}
{'_hoodie_commit_time': '20250320165301144', '_hoodie_commit_seqno': '20250320165301144_5_13', '_hoodie_record_key': '20250320165301144_2_33', '_hoodie_partition_path': '', '_hoodie_file_name': 'b3ab5aad-631a-44bf-8264-6873eb5d33e8-0_5-725-1651_20250320165301144.parquet', 'event': 'pageview', 'properties': '{"user_id":155,"page_url":"www.facebook.com","view_count":39}'}
{'_hoodie_commit_time': '20250320165301144', '_hoodie_commit_seqno': '20250320165301144_5_14', '_hoodie_record_key': '20250320165301144_2_34', '_hoodie_partition_path': '', '_hoodie_file_name': 'b3ab5aad-631a-44bf-8264-6873eb5d33e8-0_5-725-1651_20250320165301144.parquet', 'event': 'pageview', 'properties': '{"user_id":124,"page_url":"www.google.com","view_count":7}'}
{'_hoodie_commit_time': '20250320165301144', '_hoodie_commit_seqno': '20250320165301144_5_15', '_hoodie_record_key': '20250320165301144_2_35', '_hoodie_partition_path': '', '_hoodie_file_name': 'b3ab5aad-631a-44bf-8264-6873eb5d33e8-0_5-725-1651_20250320165301144.parquet', 'event': 'pageview', 'properties': '{"user_id":729,"page_url":"www.twitter.com","view_count":56}'}
{'_hoodie_commit_time': '20250320165301144', '_hoodie_commit_seqno': '20250320165301144_5_16', '_hoodie_record_key': '20250320165301144_2_36', '_hoodie_partition_path': '', '_hoodie_file_name': 'b3ab5aad-631a-44bf-8264-6873eb5d33e8-0_5-725-1651_20250320165301144.parquet', 'event': 'pageview', 'properties': '{"user_id":571,"page_url":"www.netflix.com","view_count":8}'}
{'_hoodie_commit_time': '20250320165301144', '_hoodie_commit_seqno': '20250320165301144_5_17', '_hoodie_record_key': '20250320165301144_2_37', '_hoodie_partition_path': '', '_hoodie_file_name': 'b3ab5aad-631a-44bf-8264-6873eb5d33e8-0_5-725-1651_20250320165301144.parquet', 'event': 'pageview', 'properties': '{"user_id":978,"page_url":"www.linkedin.com","view_count":20}'}
{'_hoodie_commit_time': '20250320165301144', '_hoodie_commit_seqno': '20250320165301144_5_18', '_hoodie_record_key': '20250320165301144_2_38', '_hoodie_partition_path': '', '_hoodie_file_name': 'b3ab5aad-631a-44bf-8264-6873eb5d33e8-0_5-725-1651_20250320165301144.parquet', 'event': 'pageview', 'properties': '{"user_id":550,"page_url":"www.amazon.com","view_count":46}'}
{'_hoodie_commit_time': '20250320165301144', '_hoodie_commit_seqno': '20250320165301144_5_19', '_hoodie_record_key': '20250320165301144_2_39', '_hoodie_partition_path': '', '_hoodie_file_name': 'b3ab5aad-631a-44bf-8264-6873eb5d33e8-0_5-725-1651_20250320165301144.parquet', 'event': 'pageview', 'properties': '{"user_id":52,"page_url":"www.netflix.com","view_count":33}'}
None

--

--

Albert Wong
Albert Wong

Written by Albert Wong

#eCommerce #Java #Database #k8s #Automation. Hobbies: #BoardGames #Comics #Skeet #VideoGames #Pinball #Magic #YelpElite #Travel #Candy

Responses (1)