1

I have a CSV file with ";" as separator made like this:

col1;col2;col3;col4 
4;hello;world;1;1
4;hi;1;1
4;hi;1;1
4;hi;1;1

Obviously by using ";" as sep it gives me the error about tokenizing data (obviously from the header less columns are expected), how can i obtain a dataframe like this:

col1 col2        col3 col4 
4    hello;world 1    1
4    hi          1    1
4    hi          1    1
4    hi          1    1

It could be read even with others packages and other data type (even if i prefer pandas because of the following operations in the code)

Duccio Borchi
  • 209
  • 4
  • 13
  • 1
    How do you know that the second column should be `hello;world` and not that the third column should be `world;1`? – Ignatius Reilly Sep 28 '22 at 15:12
  • This must help: https://stackoverflow.com/questions/4617935/is-there-a-way-to-include-commas-in-csv-columns-without-breaking-the-formatting – Imperial_J Sep 28 '22 at 15:12
  • 1
    This is simply a bad CSV file. Even a human can't decide what the values should be. If you need to include the separator in a field you *must* quote that field's value: `"hello;world"` – Panagiotis Kanavos Sep 28 '22 at 15:13
  • @PanagiotisKanavos unluckily I am not the CSV creator, obviously I would have used another separator or the quotations marks... – Duccio Borchi Sep 28 '22 at 15:17
  • @IgnatiusReilly in my file I just know it based on the context, as i'm aware that the 2 column is a text – Duccio Borchi Sep 28 '22 at 15:18
  • Is there an automatic way to quote ("") all the second column that I don't know about? – Duccio Borchi Sep 28 '22 at 15:23
  • Are there rules like "1) non-separator semicolons always appear in between not number characters; 2) two consecutive columns can never have non-separator semicolons"? – Ignatius Reilly Sep 28 '22 at 15:24
  • Are non-separator semicolons always in the second column? – Ignatius Reilly Sep 28 '22 at 15:24
  • Yes they are always in the second column – Duccio Borchi Sep 28 '22 at 15:25
  • And how do you know where the second column ends? You *don't*. What if it contains 3 separated values instead of 2? Are you *sure* that only the 2nd column will have a variable number of fields? If that was the case you could assume that the 1st and 3rd-end values go to individual columns, and the remainder to the 2nd. *Maybe* you could read the text as 2 columns and split the "second" based on that logic. If the assumption is wrong though, you'll end up with garbage – Panagiotis Kanavos Sep 29 '22 at 08:38

2 Answers2

2

You could split off the outer cols until you are left with the remaining col2. This could be done in Pandas as follows:

import pandas as pd

df_raw = pd.read_csv("input.csv", delimiter=None, header=None, skiprows=1)

df_raw[['col1', 'rest']] = df_raw[0].str.split(";", n=1, expand=True)
df_raw[['col2', 'col3', 'col4']] = df_raw.rest.str.rsplit(";", n=2, expand=True)
df =  df_raw[['col1', 'col2', 'col3', 'col4']]

print(df)

Giving df as:

  col1         col2 col3 col4
0    4  hello;world    1    1
1    4           hi    1    1
2    4           hi    1    1
3    4           hi    1    1
  1. First read in the CSV file without using any delimiters to get a single column.

  2. Use a .str.split() with n=1 to split out just col1 using the ; delimiter from the left.

  3. Take the remaining rest and apply .str.rsplit() with n=2 to do a reverse split using the ; delimiter to get the remaining columns. This allows col2 to have any have any characters.

This assume that only col2 can have additional ; separators and the last two are fixed.

Martin Evans
  • 45,791
  • 17
  • 81
  • 97
1
import re

pattern = r"(?<=;)(?P<second_column>[\w]+;[\w]+)(?=;)"

with open("bad_csv.csv") as file:
    text = file.readlines()

for i in range(len(text)):
    if text[i].count(';') == 4:
        text[i] = re.sub(pattern, '"' + r"\g<second_column>" + '"', text[i], 1)

with open("good_csv.csv", "w") as file:
    for row in text:
        file.writelines(row)


df = pd.read_csv("good_csv.csv", sep=';')
Ignatius Reilly
  • 1,594
  • 2
  • 6
  • 15