0

I'm using pymysql to query a mysql table in a mysql database. The mysql database is on a remote ubuntu server. The code was working fine but recently I'm getting an error:

(1114, "The table '/tmp/#sql1422a_19_46' is full")

I've checked the disk space on the ubuntu server (shown below.) I have plenty of storage available on three of my directories. It does seem like a few of the other directories may have filled up. I have cron jobs that run daily and update other tables in the mysql database. They are still updating just fine. Does anyone see what the issue might be and suggest how to fix it? It shouldn't be that large a table.

code: df -h

output:

Filesystem      Size  Used Avail Use% Mounted on
udev             16G     0   16G   0% /dev
tmpfs           3.2G  1.8M  3.2G   1% /run
/dev/nvme0n1p2  228G  215G  1.2G 100% /
tmpfs            16G   12K   16G   1% /dev/shm
tmpfs           5.0M     0  5.0M   0% /run/lock
tmpfs            16G     0   16G   0% /sys/fs/cgroup
/dev/nvme0n1p1  511M  6.7M  505M   2% /boot/efi
/dev/sda1       458G   73M  435G   1% /mnt/data/storage3_500gb
/dev/sdb1       916G   77M  870G   1% /mnt/data/storage1_1tb
/dev/nvme1n1    916G   77M  870G   1% /mnt/data/nvme0n1
/dev/sdc1       916G   77M  870G   1% /mnt/data/storage2_1tb
/dev/sdd1       916G  109G  761G  13% /mnt/data/sda
tmpfs           3.2G     0  3.2G   0% /run/user/1000
/dev/loop1      100M  100M     0 100% /snap/core/11420
/dev/loop0      100M  100M     0 100% /snap/core/11606

code:

import pandas as pd
import numpy as np

import os

import re, dateutil.parser
#BeautifulSoup provide a model for the source HTML
# from bs4 import BeautifulSoup

import time


import decimal as dc
from bs4 import BeautifulSoup as bs
import time

import json
import urllib.parse

import requests

import datetime

import logging

import pymysql

import glob

from sqlalchemy import create_engine


# function to query mysql db and return dataframe of results
def mysql_query(user,password,database,host,query):
    
    connection = pymysql.connect(user=user, password=password, database=database, host=host)


    try:
        with connection.cursor() as cursor:
            query = query


        df = pd.read_sql(query, connection)
        
        logging.info('query succeeded: '+query)
        
#     finally:
        connection.close()
        
        logging.info('close connection mysql')
        
        return df

    except Exception as err:
        
        logger.error('query failed: '+query+' got error: '+str(err))
        
        
        
    pass



# creating zillow_latest

zillow_latest_query="""with zillow_latest as
(
select
distinct 
zpid,
last_updated,
first_value(providerListingId) over (partition by zpid order by last_updated desc) as providerListingId,
first_value(imgSrc) over (partition by zpid order by last_updated desc) as imgSrc,
first_value(hasImage) over (partition by zpid order by last_updated desc) as hasImage,
first_value(detailUrl) over (partition by zpid order by last_updated desc) as detailUrl,
first_value(statusType) over (partition by zpid order by last_updated desc) as statusType,
first_value(statusText) over (partition by zpid order by last_updated desc) as statusText,
first_value(countryCurrency) over (partition by zpid order by last_updated desc) as countryCurrency,
first_value(price) over (partition by zpid order by last_updated desc) as price,
first_value(unformattedPrice) over (partition by zpid order by last_updated desc) as unformattedPrice,
first_value(address) over (partition by zpid order by last_updated desc) as address,
first_value(addressStreet) over (partition by zpid order by last_updated desc) as addressStreet,
first_value(addressCity) over (partition by zpid order by last_updated desc) as addressCity,
first_value(addressState) over (partition by zpid order by last_updated desc) as addressState,
first_value(addressZipcode) over (partition by zpid order by last_updated desc) as addressZipcode,
first_value(isUndisclosedAddress) over (partition by zpid order by last_updated desc) as isUndisclosedAddress,
first_value(beds) over (partition by zpid order by last_updated desc) as beds,
first_value(baths) over (partition by zpid order by last_updated desc) as baths,
first_value(area) over (partition by zpid order by last_updated desc) as area,
first_value(latLong) over (partition by zpid order by last_updated desc) as latLong,
first_value(isZillowOwned) over (partition by zpid order by last_updated desc) as isZillowOwned,
first_value(variableData) over (partition by zpid order by last_updated desc) as variableData,
first_value(badgeInfo) over (partition by zpid order by last_updated desc) as badgeInfo,
first_value(hdpData) over (partition by zpid order by last_updated desc) as hdpData,
first_value(isSaved) over (partition by zpid order by last_updated desc) as isSaved,
first_value(isUserClaimingOwner) over (partition by zpid order by last_updated desc) as isUserClaimingOwner,
first_value(isUserConfirmedClaim) over (partition by zpid order by last_updated desc) as isUserConfirmedClaim,
first_value(pgapt) over (partition by zpid order by last_updated desc) as pgapt,
first_value(sgapt) over (partition by zpid order by last_updated desc) as sgapt,
first_value(zestimate) over (partition by zpid order by last_updated desc) as zestimate,
first_value(shouldShowZestimateAsPrice) over (partition by zpid order by last_updated desc) as shouldShowZestimateAsPrice,
first_value(has3DModel) over (partition by zpid order by last_updated desc) as has3DModel,
first_value(hasVideo) over (partition by zpid order by last_updated desc) as hasVideo,
first_value(isHomeRec) over (partition by zpid order by last_updated desc) as isHomeRec,
first_value(hasAdditionalAttributions) over (partition by zpid order by last_updated desc) as hasAdditionalAttributions,
first_value(isFeaturedListing) over (partition by zpid order by last_updated desc) as isFeaturedListing,
first_value(list) over (partition by zpid order by last_updated desc) as list,
first_value(relaxed) over (partition by zpid order by last_updated desc) as relaxed,
first_value(hasOpenHouse) over (partition by zpid order by last_updated desc) as hasOpenHouse,
first_value(openHouseStartDate) over (partition by zpid order by last_updated desc) as openHouseStartDate,
first_value(openHouseEndDate) over (partition by zpid order by last_updated desc) as openHouseEndDate,
first_value(openHouseDescription) over (partition by zpid order by last_updated desc) as openHouseDescription,
first_value(builderName) over (partition by zpid order by last_updated desc) as builderName,
first_value(info3String) over (partition by zpid order by last_updated desc) as info3String,
first_value(brokerName) over (partition by zpid order by last_updated desc) as brokerName,
first_value(lotAreaString) over (partition by zpid order by last_updated desc) as lotAreaString,
first_value(streetViewMetadataURL) over (partition by zpid order by last_updated desc) as streetViewMetadataURL,
first_value(streetViewURL) over (partition by zpid order by last_updated desc) as streetViewURL,
first_value(info2String) over (partition by zpid order by last_updated desc) as info2String,
first_value(info6String) over (partition by zpid order by last_updated desc) as info6String
from realestate.zillow
),
distinct_values as(
select
distinct
zpid,
providerListingId,
imgSrc,
hasImage,
detailUrl,
statusType,
statusText,
countryCurrency,
price,
unformattedPrice,
address,
addressStreet,
addressCity,
addressState,
addressZipcode,
isUndisclosedAddress,
beds,
baths,
area,
latLong,
isZillowOwned,
variableData,
badgeInfo,
hdpData,
isSaved,
isUserClaimingOwner,
isUserConfirmedClaim,
pgapt,
sgapt,
zestimate,
shouldShowZestimateAsPrice,
has3DModel,
hasVideo,
isHomeRec,
hasAdditionalAttributions,
isFeaturedListing,
list,
relaxed,
hasOpenHouse,
openHouseStartDate,
openHouseEndDate,
openHouseDescription,
builderName,
info3String,
brokerName,
lotAreaString,
streetViewMetadataURL,
streetViewURL,
info2String,
info6String
from zillow_latest
)
select * from distinct_values"""

zillow_latest_df=mysql_query(user='username',
                            password='xxx',
                            database='realestate',
                            host='xxxxxxx',
                            query=zillow_latest_query)

error:

(1114, "The table '/tmp/#sql1422a_19_46' is full")
user3476463
  • 3,967
  • 22
  • 57
  • 117
  • I saw something which indicated there's a 16MB limit for temp tables. It wasn't an official MySQL page, so I'm not linking it. You might try running the query in this answer to determine the limit explicitly: https://stackoverflow.com/a/36832492/42346 – mechanical_meat Sep 15 '21 at 15:52
  • @mechanical_meat Temp tables _in RAM_ are limited by `tmp_table_size` or `max_heap_table_size`, whichever is lesser. The default for both in MySQL 8.0 is 16MB. But temp tables can be much larger than that, they just have to be persisted to disk. – Bill Karwin Sep 15 '21 at 16:22

1 Answers1

1

Well, files on /tmp would use your root filesystem, since you don't have a separate filesystem mounted at /tmp.

Filesystem      Size  Used Avail Use% Mounted on
/dev/nvme0n1p2  228G  215G  1.2G 100% /

It looks like it's nearly full to me. So a large temp table created by MySQL can fill it up.

Queries may create temp tables even if you didn't request one explicitly. For example, some queries that use GROUP BY or UNION, subqueries that produce derived tables, common table expressions, and some types of views can create temp tables. See https://dev.mysql.com/doc/refman/8.0/en/internal-temporary-tables.html for more details on this.

In your case, you are using a common table expression, so it's bound to create a temp table.

Temp tables can certainly be larger than 16MB (despite the comment above). The size of a temp table on disk is driven by the number of rows and the size of each row.

On a consulting contract some years ago, the client said that they kept getting warnings that their filesystem was out of space. But when they would go look, it had 6 gigabytes free. I helped them analyze their MySQL query log, and showed that occasionally they had at least 4 concurrent queries, each using 1.5GB of temp table space. But the temp tables disappear as soon as the query is done, so when they go check the server, it's likely that those queries aren't running anymore.

How to fix this?

There are several remedies. You might need to use more than one of these:

  • Reduce the number of rows examined by optimizing searches with indexes.
  • Reduce the total number of rows in your table, so there are fewer for the query to examine even if you can't optimize with indexes.
  • Configure MySQL's tmpdir option to another filesystem that has more space.
  • Clean up some unneeded data on the filesystem referenced by tmpdir.
Bill Karwin
  • 538,548
  • 86
  • 673
  • 828
  • thank you for getting back to me on this. I noticed that if I filter the realestate.zillow in the inner query for just the last 7 days it would complete in mysql, but when I tried to run the same query as part of the python pipeline it would only complete if I filtered it for 1 day. does that mean I have two issues, one I'm running out of space on mysql and the other I'm running out of space in the python? if so can you suggest how to fix both? also I tried editing my /etc/mysql/my.cnf to add tmpdir = /mnt/data/sda/user_storage/username/mysql_tmp/ but it threw errors on restart – user3476463 Sep 16 '21 at 14:24
  • I can't guess what errors you got. Python doesn't store temp tables on disk like MySQL does. If Python receives a result set, it stores it in RAM. – Bill Karwin Sep 16 '21 at 14:54