Reading huge Hive table using jdbc leads to java.lang.OutOfMemoryError: Java heap space

Question

I'm trying to read huge hive table with 13 million records using groovy where the data is in parquet format. I used following to code but I'm getting OOM java heap space error. I have given max 32 GB memory and setFetchsize(5000) still its getting error.

JAVA_OPTS="-Xms1024M"
JAVA_OPTS="-Xmx32556M"

Any Help will be appreciated.

Code:

 String contSql = "select * from staging.cont_staging";
                ResultSet resRateRecords = stmt.executeQuery(contSql);
                Map <String,Map<String,String>> masterRecords = new HashMap<String,Map<String,String>>();
                Map<String,String> existingRecords = null;
                int count = 0;
                resRateRecords.setFetchSize(5000);
                while(resRateRecords.next()) {

                        try{existingRecords = new HashMap<String,String>();
 masterRecords.put(resRateRecords.getString("contract_id")+"#"+count++,existingRecords);
                        }catch(Exception e){
                                e.printStackTrace();
                        }

Error

java.lang.OutOfMemoryError: Java heap space
        at org.apache.thrift.protocol.TBinaryProtocol.readStringBody(TBinaryProtocol.java:355)
        at org.apache.thrift.protocol.TBinaryProtocol.readString(TBinaryProtocol.java:347)
        at org.apache.hive.service.cli.thrift.TStringColumn$TStringColumnStandardScheme.read(TStringColumn.java:453)
        at org.apache.hive.service.cli.thrift.TStringColumn$TStringColumnStandardScheme.read(TStringColumn.java:433)
        at org.apache.hive.service.cli.thrift.TStringColumn.read(TStringColumn.java:367)
        at org.apache.hive.service.cli.thrift.TColumn.standardSchemeReadValue(TColumn.java:328)
        at org.apache.thrift.TUnion$TUnionStandardScheme.read(TUnion.java:224)
        at org.apache.thrift.TUnion$TUnionStandardScheme.read(TUnion.java:213)
        at org.apache.thrift.TUnion.read(TUnion.java:138)
        at org.apache.hive.service.cli.thrift.TRowSet$TRowSetStandardScheme.read(TRowSet.java:573)
        at org.apache.hive.service.cli.thrift.TRowSet$TRowSetStandardScheme.read(TRowSet.java:525)
        at org.apache.hive.service.cli.thrift.TRowSet.read(TRowSet.java:451)
        at org.apache.hive.service.cli.thrift.TFetchResultsResp$TFetchResultsRespStandardScheme.read(TFetchResultsResp.java:518)
        at org.apache.hive.service.cli.thrift.TFetchResultsResp$TFetchResultsRespStandardScheme.read(TFetchResultsResp.java:486)
        at org.apache.hive.service.cli.thrift.TFetchResultsResp.read(TFetchResultsResp.java:408)
        at org.apache.hive.service.cli.thrift.TCLIService$FetchResults_result$FetchResults_resultStandardScheme.read(TCLIService.java:13251)
        at org.apache.hive.service.cli.thrift.TCLIService$FetchResults_result$FetchResults_resultStandardScheme.read(TCLIService.java:13236)
        at org.apache.hive.service.cli.thrift.TCLIService$FetchResults_result.read(TCLIService.java:13183)
        at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:78)
        at org.apache.hive.service.cli.thrift.TCLIService$Client.recv_FetchResults(TCLIService.java:505)
        at org.apache.hive.service.cli.thrift.TCLIService$Client.FetchResults(TCLIService.java:492)
        at org.apache.hive.jdbc.HiveQueryResultSet.next(HiveQueryResultSet.java:335)
        at java_sql_ResultSet$next.call(Unknown Source)
        at BEContractRateLoad.fetchContractRateRecords(DestRateLoad.groovy:300)
        at BEContractRateLoad.processContractRecords(DestRateLoad.groovy:397)
        at BEContractRateLoad$processContractRecords$1.call(Unknown Source)
Groovy has reported an error, terminating

@Jens I tried the solutions in the provided link already still facing the same issue. — marjun, Dec 14 '18 at 08:46
You're storing all records inside `masterRecords`, it would be better to make a more specific query which returns only the records you're interested in. Loading all records inside `masterRecords` is what's causing the exception you're facing. — Mark, Dec 14 '18 at 09:03
Since you are trying to iterate through all the 13m records and store it in a single map object, there is a chance of getting OOM. A couple of things can be done to bring more efficiency. (1) Use StringBuffer class to construct the string for the map key. (2) the existingRecords object is always new in each iteration, this could be placed as null and during further processing, you can fill it when needed. However, it may still fail as record size is too big and accomodating in memory may not be possible. — H Roy, Dec 14 '18 at 09:37
@HRoy Thanks for inputs. It will be helpful if you please provide some code snippet? — marjun, Dec 14 '18 at 12:11

Reading huge Hive table using jdbc leads to java.lang.OutOfMemoryError: Java heap space

0 Answers0