Python notebook - Importing data file with delimiter with two characters causes an error

Question

Question: We are encountering the following error when loading a data file - that has two characters delimiter - to an Azure SQL Db. What we may be doing wrong and how can the issue be resolved?

Using a Python notebook in Azure Databricks, we are trying to load a data file into Azure SQL Db. The delimiter in the data file has two characters ~*. On the following code we get the errors shown below:

pandas dataframe low memory not supported with the 'python' engine

Code:

import sqlalchemy as sq
import pandas as pd

data_df = pd.read_csv('/dbfs/FileStore/tables/MyDataFile.txt', sep='~*', engine='python', low_memory=False, quotechar='"', header='infer' , encoding='cp1252')
.............
.............

Remarks: If we remove the low_memory option, we get the following error. Although with other data files that are even larger than this file but have delimiter with a single character, we don't get the following error.

ConnectException: Connection refused (Connection refused) Error while obtaining a new communication channel ConnectException error: This is often caused by an OOM error that causes the connection to the Python REPL to be closed. Check your query's memory usage.

Barmar · Accepted Answer · 2022-02-24T15:19:20.867

1

From the documentation of Pandas.read_csv():

In addition, separators longer than 1 character and different from '\s+' will be interpreted as regular expressions and will also force the use of the Python parsing engine.

Since it's being interpreted as a regular expression, and * has special meaning in regexp, you need to escape it. Use sep=r'~\*'

edited Feb 24 '22 at 15:19

answered Feb 23 '22 at 19:46

Barmar

741,623
53
500
612

Your suggestion worked (thank you). To the point and well explained response (my upvote). I assume `r` is a typo in `sep=r'~\*'`. If not please let me know what that's for. Also, for the benefit of other users, you may want to provide a link to the document you've referred. – nam Feb 24 '22 at 15:17
No, it's not a typo. That's a raw string literal, needed because of the backaslash in the regexp. See https://stackoverflow.com/questions/12871066/what-exactly-is-a-raw-string-regex-and-how-can-you-use-it – Barmar Feb 24 '22 at 15:18
I have a similar question posted [here](https://stackoverflow.com/q/72209325/1232087) in case you have time for a suggestion. – nam May 12 '22 at 02:02

score 0 · Answer 2 · answered Feb 23 '22 at 19:37

Probably your File ist too large and the dataframe does not fit in memory. Can you try to split the Processing Up? I.e. read 1000 Limes, make a dataframe from that, Push to SQL, Thema read next 1000 lines etc?

nrows and skiprows passed to read_csv can be used for this.

Maybe a workaround: preprocess the file with sed s/-*/;/g, then you can use the c engine with lower memory footprint.

Python notebook - Importing data file with delimiter with two characters causes an error

2 Answers2