pandas read .csv file without suitable delimiter. (only seperate first column vs "rest")

Question

I try importing a .csv file into python pandas as the following:

dataframe = pd.read_csv(inputfile, sep=delimiter, header=None)

However, each line of the (huge) inputfile consists of an integer, followed by some string. Like this:

1234 this string % might; contain 눈 anything

The result should be a two column dataframe which has said Integer on position 1 and the rest of the line in position 2.

Since any character can occur in the string I am unable to use a single character as a separator. Trying to use a highly unlikely long string sequence like "khlKiwVlZdsb9oVKq5yG" as a delimiter for one feels like a dirty workaround, secondly may not be 100% reliable and thirdly causes the following "error/inconvenience":

ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators (separators > 1 char and different from '\s+' are interpreted as regex); you can avoid this warning by specifying engine='python'.

So my question is: Is there any better way to deal with my Problem? Maybe some option to tell pandas to ignore any further delimiters after the first in a line has been encountered?

Thank you for any suggestions!

Take a look here: https://stackoverflow.com/questions/15026698/how-to-make-separator-in-read-csv-more-flexible-wrt-whitespace — W Stokvis, Apr 27 '18 at 18:48

Guybrush · Accepted Answer · 2018-04-29T06:29:07.910

4

Basically, your .csv is not a csv ;-)

I suggest that you manually open and read that file, splitting each line using the first one whitespace, and then convert the result into a DataFrame if needed.

fp = ...  # your file pointer
data = [line.split(' ', maxsplit=1) for line in fp]

If you have a lot of data in your file, consider using a generator expression instead.

In both case, you can convert data to a DataFrame:

pandas.DataFrame.from_records(data, columns=['Integer', 'String'])

(.. or directly using DataFrame constructor)

edited Apr 29 '18 at 06:29

answered Apr 27 '18 at 18:48

Guybrush

2,680
1
10
17

I believe this is exactly what i was looking for. Thank you very much! line.split(' ', maxsplit=1) has helped me a lot :) – pcsso Apr 28 '18 at 20:45
You're welcome! Could you please mark the question as solved.? – Guybrush Apr 29 '18 at 06:27

pandas read .csv file without suitable delimiter. (only seperate first column vs "rest")

1 Answers1