How to Process Data in Batches in Python

There are some situations when you have a huge list of items to process but you cannot do them in one go due to some limitations of the systems that process the list.

Some examples:

  • When you need to access an API that supports only 100 items at once in the request, you need to split your original list into lists of 100 items & combine the results.
  • You have a long list of items that you want to process parallely. You can split them into the number of sub processes that you want & process them independently.
long_list = list(range(100))
sub_list_length = 10
sub_lists = [
    long_list[i : i + sub_list_length]
    for i in range(0, len(long_list), sub_list_length)
]

Let us try to break down the code

long_list is a list of 100 numbers. We are splitting this list of numbers into sub lists specified by the sub_list_length of 10. The list comprehension is relying on slices of the original list.

print(long_list)
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99]
print(sub_lists)
[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9], [10, 11, 12, 13, 14, 15, 16, 17, 18, 19], [20, 21, 22, 23, 24, 25, 26, 27, 28, 29], [30, 31, 32, 33, 34, 35, 36, 37, 38, 39], [40, 41, 42, 43, 44, 45, 46, 47, 48, 49], [50, 51, 52, 53, 54, 55, 56, 57, 58, 59], [60, 61, 62, 63, 64, 65, 66, 67, 68, 69], [70, 71, 72, 73, 74, 75, 76, 77, 78, 79], [80, 81, 82, 83, 84, 85, 86, 87, 88, 89], [90, 91, 92, 93, 94, 95, 96, 97, 98, 99]]
print(list(range(0, len(long_list), sub_list_length)))
[0, 10, 20, 30, 40, 50, 60, 70, 80, 90]

Processing the Sub Lists

results = []
for sub_list in sub_lists:
    partial_result = process_function(sub_list)
    results.append(partial_result)

Here process_function can be any function that processes the lists. In our example, this would be the call to the API or sub processes that process the lists.

Bonus Tip: Keeping track of Progress

Additionally, you can keep track of the progress of the process by wrapping up the for loop iterable using an open source library, tqdm to display a progress bar that also indicates how long each iteration takes. It works for any iterable as well.

from tqdm import tqdm
results = []
for sub_list in tqdm(sub_lists):
    partial_result = process_function(sub_list)
    results.append(partial_result)

Progress Bar

This has come in quite handy for me in quite a few cases.

Cover Photo from CHUTTERSNAP on Unsplash

Comments (2)

Héctor Ramírez's photo

Besides presenting a nice technique for data processing -My, my! Back in the 70's I used to work with batch jobs using punched card stacks-, you also made me discover a very nice open source library, tqdm, that is (was, really) an undiscovered treasure for my toolbox. Keep up the nice work, Nithish!

Nithish Raghunandanan's photo

Thanks for your kind words. I can imagine your pain working with the batch jobs using punched cards. tqdm is definitely one of the nice libraries that you just cannot stop using once you learn about it :)