0

I am a beginner in Python and I am facing a problem. I would like a user to select a CSV file to be read-in. In the case that the program cannot locate the file or handle the condition, it should default to an error.

I have successfully implemented this solution for small file sizes (< 50000 rows) but when the selected file becomes larger (e.x. > 50000 rows), the program freezes.

The following are some characteristics to consider:

  1. My computer has 8GB of RAM.
  2. The selected file was only 200k+ rows, which is not considered "Big Data."

The following is my attempt at an implementation:

def File_DATALOG():
    global df_LOG
    try:
        dataloggerfile = tk.filedialog.askopenfilename(parent=root,
                                                       title='Choose Logger File',
                                                       filetype=(("csv files", "*.csv"),
                                                                 ("All Files", "*.*")))

        if len(dataloggerfile) == 0:
            return None

        lb.insert(tk.END, dataloggerfile)
        if dataloggerfile[-4:] == ".csv":
            df_LOG = pd.DataFrame(pd.read_csv(dataloggerfile))
            if 'Unnamed: 1' in df_LOG.columns:
                df_LOG = pd.DataFrame(pd.read_csv(dataloggerfile, skiprows=5, low_memory=False))
        else:
            df_LOG = pd.DataFrame(pd.read_excel(dataloggerfile, skiprows=5))

        df_LOG.rename(columns={'Date/Time': 'DateTime'}, inplace=True)
        df_LOG.drop_duplicates(subset=None, keep=False, inplace=True)
        df_LOG['DateTime'] = df_LOG['DateTime'].apply(lambda x: insert_space(x, 19))
        df_LOG['DateTime'] = pd.to_datetime(df_LOG['DateTime'], dayfirst=False, errors='coerce')
        df_LOG.sort_values('DateTime', inplace=True)
        df_LOG = df_LOG[~df_LOG.DateTime.duplicated(keep='first')]
        df_LOG = df_LOG.set_index('DateTime').resample('1S').pad()
        print(df_LOG)
        columnsDict['Logger'] = df_LOG.columns.to_list()

    except Exception as ex:
        tk.messagebox.showerror(title="Title", message=ex)
        return None
Zakariah Siyaji
  • 989
  • 8
  • 27
  • `read_csv()` and `read_excel()` gives `DataFrame` and you don't have to use it with `pd.DataFrame(...)` because it will have to convert `DataFrame` to the same `DataFrame` but it need extra memory for new `DataFrame` - You need only `df_LOG = pd.read_csv()` and `df_LOG = pd.read_excel(...)` – furas Apr 27 '21 at 03:47
  • you could use `print()` to display message between commands - to see which command freeze it. – furas Apr 27 '21 at 03:50
  • in `if 'Unnamed: 1' in df_LOG.columns:` you read the same file again. You should use datafame which you have already in memory, and drop 5 rows and convert next row to column names – furas Apr 27 '21 at 03:53
  • [pandas has a tool for that](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.memory_usage.html) Also, as @fura said you are creating a duplicate dataframe for twice the memory. – Him Apr 27 '21 at 04:44
  • i got two type of files to read, both of them have same file name but differences only on the first 5 rows. So i used 'Unnamed:1' to detect the type of file to remove the rows. – yong whang Apr 28 '21 at 00:05
  • i removed the pd.DataFrame but still same. the error message cant display because when i choose the file, it freeze my computer, even i used ```print()``` between commands. – yong whang Apr 28 '21 at 00:10

1 Answers1

-1

You are trying to take all file at once. But your memory isn't enough for that as you saying.

So you need to process data little by little.

Here is a good and simple example: https://stackoverflow.com/a/43286094/7285863

ishak O.
  • 168
  • 1
  • 2
  • 14