-1

I am trying to cross join to tables.

import os
import pandas as pd
import sys
import pyodbc

con = pyodbc.connect("DRIVER= {NetezzaSQL};SERVER=netezza.xxx.com;PORT=5480;DATABASE=XXXXXX;UID=xxxxx;PWD=xxxxxxx;")

 con.setdecoding(pyodbc.SQL_CHAR, encoding='utf-8')
 con.setdecoding(pyodbc.SQL_WCHAR, encoding='utf-8')
 con.setdecoding(pyodbc.SQL_WMETADATA, encoding='utf-8')
 con.setencoding(encoding='utf-8')


sql1 = """select * from table1"""
input1 = pd.read_sql_query(sql1,con)

                 student_name
          0      abc
          1      def
          2      ghi

sql2 = """select * from table2"""
input2 = pd.read_sql_query(sql2,con)

                  subjects
           0      maths
           1      english

what i need is a cross join on table1 and table2 if i have to use sql it would be

   select table1.*, table2.* from table1 cross join table2

                   student_name    subjects
           0       abc             maths
           0       abc             english
           1       def             maths
           1       def             english
           2       ghi             maths
           2       ghi             english

i am trying to get above output using below line of code but it is throwing MemorryError. Any way of improving the code?

input1.assign(foo=1).merge(input2.assign(foo=1)).drop('foo',1)

Edited 3/28/2019: Since above one is not working i changed code to reduce columns. For this i converted dataframes to lists and looping on them to created output as a list. Buy the code is very slow. Any help to improve performance? I have a dataframe with key as first column and five columns with values between 0 and 1. Second dataframe have a different key as first column and 5 columns with 0 or 1 as values.

    a = input5.values.tolist()
    b = input6.values.tolist()
    c = []

for i in range(len(a)):
  for j in range(len(b)):
    denominator = ((a[i][1]*float(b[j][1]))+(a[i][2]*float(b[j][2]))+(a[i] 
     [3]*float(b[j][3]))+(a[i][4]*float(b[j][4]))+(a[i][5]*float(b[j][5])))
    numerator = (a[i][1]+a[i][2]+a[i][3]+a[i][4]+a[i][5])
    mean = numerator/denominator
    new_row = [a[i][0],b[j][0],mean]
    c.append(new_row)
len(c)
Raghu
  • 1
  • 2
  • Do you want to achieve Cartesian product of the two columns? Check this https://stackoverflow.com/questions/13269890/cartesian-product-in-pandas – Konrad Mar 27 '19 at 20:56
  • Thanks for sharing the link. I read the link before posting the question. – Raghu Mar 27 '19 at 21:40

1 Answers1

0

Instead of reading from a database, I created both input Dataframes as:

input1 = pd.DataFrame({'student_name': [ 'abc', 'def', 'ghi' ]})
input2 = pd.DataFrame({'subject': [ 'maths', 'english' ]})

One detail concerning "keeping logic in names": As you called the column name in input1 the singular way (student_name), I followed the suit and assumed the column name in input2 also the singular way (subject).

Then the merge, for such limited DataFrames can be performed as:

pd.merge(input1.assign(foo=1), input2.assign(foo=1),
    on='foo')[['student_name', 'subject']]

(your solution didn't include on='foo', but actually the default set of join columns are all common columns, so you can omit it).

At least on my computer and for such limited data, the above command works.

But the problem maybe will continue to exist if you try to merge real DataFrames, containing significantly more rows.

Remember that the number of rows is a product of row numbers from both source DataFrames.

If you still have the same error with real data, try experimenting with smaller number of records in both source DataFrames.

Valdi_Bo
  • 30,023
  • 4
  • 23
  • 41
  • Code is just a sample to explain the problem. Sorry for not sharing the data volume. I tried with on='foo' as well and i am getting same MemoryError. The memory issue is real as the data volume is high The data is not limited 3 records and 2 records as shown in the example. I am trying to join dataframe1 : 1000 x 25 with dataframe2 10000000x25 to create a dataframe3 : 10000000000x50. The code works fine with smaller number of records. – Raghu Mar 27 '19 at 21:37
  • Another concept: Leave the join to the database. But there is still the same risk that the output will be too big to fit in memory. Or maybe your project assumptions are wrong (the failed to take into account the size of the DataFrame)? Are you sure you really need the Cartesian product of such big tables? – Valdi_Bo Mar 28 '19 at 07:50
  • Hi Valdi, Initially we tried on database side and it is having memory issues. So trying on GPU server with python. Requirement is real, we added some conditions and reduced the output requirement from 10billion to 2.8Billion rows. Still having memory challenges. – Raghu Mar 28 '19 at 18:12