1

I am not the sharpest tool in the shed so please bear with me. I have successfully imported a CSV into SQL Server 2016. Unfortunately the CSV had all these pipe characters in it. I couldn't get rid of these pipe characters and now need your help. I must either find a way to import the CSV while omitting the pipes or find a way to get rid of the pipes that are in my table in SQL.

Here is what the website I got the CSV from had to say about these pipe characters:

The major Open Data tables are provided in a non-standard format that allows dirty data to be imported as we are provided some raw data fields that can contain formatting and other unprintable characters that choke many data systems. In this bulk data, text fields are surrounded by the pipe character (ascii 124). Date and numeric fields are not. Commas separate all fields.

Here is the query I used to import the CSV:

USE [open secrets]

CREATE TABLE cands16 (
    [Cycle] [char](6) NOT NULL,
    [FECCandID] [char] (11) NOT NULL,
    [CID] [char] (11) NULL,
    [FirstLastP] [varchar] (52) NULL,
    [Party] [char] (30) NULL,
    [DistIDRunFor] [varchar] (600) NULL,
    [DistIDCurr] [varchar] (600) NULL,
    [CurrCand] [char] (30) NULL,
    [CycleCand] [char] (30) NULL,
    [CRPICO] [char] (30) NULL,
    [RecipCode] [char] (5) NULL,
    [NoPacs] [varchar] (200) NULL
) ON [PRIMARY]


BULK
INSERT cands16
FROM 'C:\aaa open secrets\CampaignFin16\Cands16.txt'
WITH
(
FIELDTERMINATOR = ',',
ROWTERMINATOR = '\n'
)
GO

Here is a baby version of my CSV:

|2016|,|H4GA02060|,|N00035294|,|Greg Duke (R)|,|R|,|GA02|,|    |,|Y|,|Y|,|C|,|RC|,| |
|2016|,|H4GA02078|,|N00036257|,|Vivian Childs (R)|,|R|,|GA02|,|    |,| |,| |,| |,|RN|,| |
|2016|,|H4GA04116|,|N00035798|,|Thomas Brown (D)|,|D|,|GA04|,|    |,| |,| |,| |,|DN|,| |
|2016|,|H4GA04124|,|N00035862|,|Thomas Wight (D)|,|D|,|GA07|,|    |,| |,| |,| |,|DN|,| |
|2016|,|H4GA06087|,|N00026160|,|Tom Price (R)|,|R|,|GA06|,|GA06|,|Y|,|Y|,|I|,|RW|,| |
|2016|,|H4GA08067|,|N00026163|,|Lynn A Westmoreland (R)|,|R|,|GA03|,|GA03|,| |,|Y|,|I|,|RI|,| |
|2016|,|H4GA09065|,|N00036258|,|Bernard Fontaine (R)|,|R|,|GA09|,|    |,| |,|Y|,|C|,|RL|,| |
|2016|,|H4GA10071|,|N00035370|,|Mike Collins (R)|,|R|,|GA10|,|    |,| |,| |,| |,|RN|,| |
|2016|,|H4GA11046|,|N00035321|,|Susan Davis (R)|,|R|,|GA11|,|    |,| |,| |,| |,|RN|,| |
|2016|,|H4GA11053|,|N00002526|,|Bob Barr (R)|,|R|,|GA11|,|    |,| |,| |,| |,|RN|,| |
|2016|,|H4GA11061|,|N00035347|,|Barry Loudermilk (R)|,|R|,|GA11|,|GA11|,|Y|,|Y|,|I|,|RW|,| |

Here is a picture of what my table looks like:

my table

DYZ
  • 55,249
  • 10
  • 64
  • 93
ana ng
  • 107
  • 2
  • 8

5 Answers5

2

Since you apparently have access to Python, I recommend using Python Pandas. The problem with your CSV file is that the "pipes" are used as quote characters. You can easily strip them off:

import pandas as pd
pd.read_csv('baby.csv', quotechar='|', header=None).\
           to_csv('adult.csv', header=False)
DYZ
  • 55,249
  • 10
  • 64
  • 93
  • I was mistaken to put the python tag in. I don't have any experience with python. However I do have python. your instructions look pretty simple. I will give it a go. – ana ng Mar 04 '17 at 04:51
  • where can I get the module python pandas? also, do I need to put the filepath in? Or is the name of my csv and its extension enough? – ana ng Mar 04 '17 at 04:56
  • Did you test this? There is a comma separator and the cells are surrounded by pipes. – tdelaney Mar 04 '17 at 05:03
  • @tdelaney Yes I did. – DYZ Mar 04 '17 at 06:08
  • I wouldn't call that CSV file a problem. CSV files optionally use a "text qualifier" so that text fields are not prohibited from including the delimiter character. The writer happened to chose | instead of the more common " As you show, CSV libraries know all about this. – Tom Blodget Mar 04 '17 at 14:57
2

You have to add the FIELDQUOTE to your BULK INSERT

So your code would look like this:

BULK INSERT cands16
FROM 'C:\aaa open secrets\CampaignFin16\Cands16.txt'
WITH
(
FIELDTERMINATOR = ',',
ROWTERMINATOR = '\n',
FIELDQUOTE = '|'
)
GO

Please note that this is only available in SQL 2016+

As a temporary solution you can also use:

UPDATE my_table
SET piped_col = REPLACE(piped_col, '|', '')

Here is more information:

BULK INSERT documentation

morinx
  • 635
  • 7
  • 19
0

You can use the find and replace tool in Microsoft Excel to change every occurrence of "|" to "". This would quickly scrub the CSV.

Shaun-Adam
  • 76
  • 8
  • Wouldn't that require converting the CSV to excel? I don't know how to do that. – ana ng Mar 04 '17 at 04:52
  • 1
    Excel know how to import CSV files. Just open it. – DYZ Mar 04 '17 at 04:53
  • I attempted your solution. For the most part it worked extremely well. However I ended up getting some unwanted " symbol in at the beginning of my Cycle values and at the end of my NoPacs values. You can see what it looks like in this link http://prntscr.com/eftb2t – ana ng Mar 04 '17 at 05:14
0

This python file will scrub the txt file and write a proper csv file that can then be imported.

import csv
import re

_pipe_stripper = re.compile(r"\|?(.*?)\|?$").match

def pipe_stripper(col):
    return _pipe_stripper(col).group(1)

with open("Cands16.txt", newline="") as infile:
    with open("Cands16.csv", "w", newline="") as outfile:
        reader = csv.reader(infile)
        writer = csv.writer(outfile)
        writer.writerows(list(map(pipe_stripper, row)) for row in reader)

DEBUG

Here is the same script except it prints each row before processing. It should raise the same error but the guilty text will be the last line displayed.

import csv
import re

_pipe_stripper = re.compile(r"\|?(.*?)\|?$").match

def pipe_stripper(col):
    return _pipe_stripper(col).group(1)

with open("Cands16.txt", newline="") as infile:
    with open("Cands16.csv", "w", newline="") as outfile:
        reader = csv.reader(infile)
        writer = csv.writer(outfile)
        for row in reader:
            print(row)
            writer.writerow(list(map(pipe_stripper, row)))
tdelaney
  • 73,364
  • 6
  • 83
  • 116
  • I attempted your solution. In this link you can see the results: http://prntscr.com/eftagh – ana ng Mar 04 '17 at 05:10
  • I can't figure out anything that regular expression wouldn't match. Can you post the bad line? You could see how many lines got into the output file and then grab the next one from the input file. – tdelaney Mar 04 '17 at 05:37
  • What happened was the the regular expression didn't match one of the columns, so it returned a None, but I still tried to call `.group(1)` on it. I could "fix" the problem by checking for None and returning a space. But since I can't think of what could break that regex, I'd like to see the text that went wrong. – tdelaney Mar 04 '17 at 05:41
  • Is there a way I can show you where the text went wrong? – ana ng Mar 04 '17 at 05:44
  • I used your test code and got a huge wall of text. It was so big that python couldn't hold it all in. Here is a link to the picture if that is helpful: http://prntscr.com/eftj86 – ana ng Mar 04 '17 at 05:54
  • That is strange. The CSV reader isn't splitting on `\r\n` but that's the normal CSV line ending and should not be a problem. I assume you are using python 3.x. Which operatiing system? I really can't explain why the line detection didn't work. – tdelaney Mar 04 '17 at 06:56
  • OS is Windows 10. Python is 3.4 – ana ng Mar 04 '17 at 07:40
0

Never thought I'd see the day when I actually managed to answer a question, let alone my own.

I decided to open up my CSV in notepad and use find and replace to get rid of all the pipes before attempting to import the CSV once more. It worked out splendidly.

ana ng
  • 107
  • 2
  • 8