There is a data size constraint on protobuff, as it’s using `int` type for bufferSize, which limits the maximum value to serialize to 2GB.
There are two possibly workaround, which are basically same concept:
flush to the same stream by batch
as an example, it could either batch per 1 million rows, or if the data size is above 268 Mb
while (rs != null && rs.next()) { models.addModels(..newBuilder().set...(rs.getString("..")...) .build()); if(++rowcount > 1_000_000){ // if(rowcount > 1_000_000 || models.build().getSerializedSize() > Math.pow(2,28)){ rowcount=0; //flush by batch try (FileOutputStream fos = new FileOutputStream(Constants.MODEL_PB_FILE, true)) { model.build().writeTo(fos); } catch (FileNotFoundException e) { e.printStackTrace(); } catch (IOException e) { e.printStackTrace(); } model.clear(); } }
alternatively, this could be pushed and read by batch from different streams.
if(++rowcount >= 1_000_000){ rowcount=0; //flush by batch try (FileOutputStream fos = new FileOutputStream(CACHE_FILE + currentFileIndex++, true)) { models.build().writeTo(fos); } catch (FileNotFoundException e) { e.printStackTrace(); } catch (IOException e) { e.printStackTrace(); } models.clear(); }
same for read
Files.list(Paths.get(Constants.CACHE_FILE_DIR + File.separator+Constants.PB_FILE)).filter(Files::isRegularFile) .map(Path::toFile) .filter(file -> file.getName().startsWith(Constants.PB_FILE)) .parallel().map(file -> readFile(file)) .reduce(....)