6

I have a .csv file which my users have to download, input some data and upload to my site.

Is there a better way of ensuring the data gets uploaded successfully based on my snippet below? What else should I be checking for? Would using a dialect be better?

def import(resident_file):

    try:
        file = resident_file.file.path
        reader = csv.reader(open(file, 'rU'), delimiter=',', quotechar='"')
        headerline = reader.next()

        for row in reader:
            try:
                # do stuff

            except Exception, e:
                print e

    except Exception, e:
        print e

An example of a problem I am running into is that when a user opens the file, inputs data and saves it, the delimiters change from , to ;. How can I cover the various types of delimiters that the document could be saved in due to it being open in different programmes e.g excel in windows, excel in mac, open office in mac, open office in linux etc

Another example of a problem is when the user tries to copy and paste the data into the template provided, all hell breaks loose.

UPDATE I'm using the Sniffer class now as mentioned in one of the answers below but its still not fool proof.

UPDATED CODE SNIPPET

def bulk_import_residents(condo, resident_file):

    """
    COL 1       COL 2       COL 3           COL 4           COL 5        
    first_name  last_name   contact_number  unit_number     block_number

    """

    file_path = resident_file.file.path
    csvfile = open(file_path, 'rb')
    dialect =  csv.Sniffer().sniff(csvfile.read(1024))
    csvfile.seek(0)
    reader = csv.reader(csvfile, dialect)
    headerline = reader.next()

    for row in reader:
        try:
            data = ResidentImportData()
            data.condo = condo
            data.file = resident_file
            data.first_name = row[0]
            data.last_name = row[1] 
            data.contact_number = row[2]
            data.unit_number = row[3]
            data.block_number = row[4]
            data.save()
        except Exception, e:
            print '{0}'.format(e)
            raise Http404('Wrong template format')
super9
  • 29,181
  • 39
  • 119
  • 172
  • 1
    If your users are using a tool to edit the csv, then it would be easier to stick to the conventions used by that tool. – jcollado Jan 19 '12 at 16:47
  • 2
    Just to make sure: have you considered using a cloud-based spreadsheet editor (like Google Docs, for example) to allow your users to make their changes directly, rather than having to download, edit, and upload? – David Z Feb 05 '12 at 08:46
  • No I haven't but that's a fantastic suggestion. – super9 Feb 05 '12 at 09:00
  • 2
    Nag: That's not a 404 (Page Not Found). I'd say it's more like 400 (Bad Request). – nfirvine Feb 07 '12 at 17:08
  • If you want someone to actually address the copy-paste problem, you'll need to describe "all hell breaks loose". :D – nfirvine Feb 07 '12 at 17:11
  • Please post examples of inputs where `Sniffer` doesn't work. If you're using common spreadsheet software and it's not detecting the dialect properly, it's possibly a bug. – nfirvine Feb 07 '12 at 17:16
  • @super9: does my solution work for your problem ? – ptitpoulpe Feb 09 '12 at 16:23

9 Answers9

2

CSV is a non-format. The Sniffer class is not foolproof because it's actually impossible to 100% reliably detect all given dialects.

I think you're going to have to use Sniffer for the 90% of the time it works, and capture invalid input files, analyze them, and extend Sniffer to catch them.

nfirvine
  • 1,479
  • 12
  • 23
2

I completely agree with nfirvine (CSV IS A NON FORMAT) - okay, not that harsh. But it is a minimal format. Its very loose. Expect things to break frequently if you use CSV, as it sounds like you are already experiencing this.

I also agree with Mike Bynum - use something like XML.

But I understand that even if there is a better way, there is often the pragmatic way. Maybe you gotta stick with your format a plethora of reasons...so: two routes.

Route 1: CSV

I've done (am doing) this route now. My users update data on a daily basis (couple thousand records). Given the frequency and # of records updated, I really wish I had gone the second route: when dealing with a significant amount of data or updates, solid data validation is a huge time saver.

That said. When you are stuck with CSV. I suggest you do the following:

  • Provide your users with a good/common definition of CSV, namely RFC 4180. Make sure your customer understands what you expect their file to contain:
    • A header line.
    • Commas as separation
    • Quotes around any data that contains commas.
  • Along with that definition, give your users a sample of the CSV (which it sounds like you did, good!). Explain that you can't process a CSV file that doesn't conform to your data definition.
  • Make sure the text file type is what you expect it to be before you import it - see convert to/from Unix/Windows.
  • Within your CSV parser you need to adopt the fail fast methodology, and make sure you have a mechanism to notify your users when the CSV file doesn't conform to the standard you expect. Give them as much information as possible (provide the exception details...if not for them, at least for you).
  • This problem you are having with one customer's files suggests that you might want to give your customers some direction as far as editors you know work correctly. Excel should work, or Open Office. I suggest a spreadsheet application b/c they a good job of exporting to CSV and taking care of quoting, etc.

Not that you can't support some oddities, but in general - you want to avoid it, and you want to avoid accidentally importing badly formed data.

Route 2: XML

I would suggest you do the following:

  • Define what the data your users have to import is with a schema definition (XSD). I like to keep the w3c definitions on hand. But there are good tutorials to help you write your own XSD definition.

  • Give your users a sample XML file to fill in, and a suggestion for an editor. There are great commercial ones, and reasonable free ones.

  • You can read your user's XML files, and be sure that if it validates then its good to go. For that matter, your users can validate before they send it to you.

Community
  • 1
  • 1
dsummersl
  • 6,588
  • 50
  • 65
  • +1 I think this solution is good in industry when setting up a feed between companies. However I don't think you can expect your user (who may not be tech savvy) to deal with csv formats or xml format specifications. Also I think you need to be careful converting files from unix to windows etc, as you may want to preserve newline characters within the quotechars. – Rusty Rob Feb 11 '12 at 23:52
1

Ah just found the sniffer class.

csvfile = open("example.csv", "rb")
dialect = csv.Sniffer().sniff(csvfile.read(1024))
csvfile.seek(0)
reader = csv.reader(csvfile, dialect)
# ... process CSV file contents here ...
super9
  • 29,181
  • 39
  • 119
  • 172
1

Take a look at csv.Sniffer, which can help you guess what csv dialect a file is using.

Once you have a guess from the sniffer, try actually parsing the file with that dialect. If there are any properties of the data you can depend on (e.g., a certain number of fields), apply them to every retrieved record as a sanity check.

You can also have a two-stage upload process. First upload the file and sniff the dialect. Then show the user what a few rows of the data look like to you after parsing, and give the user the option to override some dialect settings in case it got it wrong. Then process the csv after confirmation. (The "import" dialog in Excel uses this multi-stage method.)

Francis Avila
  • 31,233
  • 6
  • 58
  • 96
0

I propose this method that check all the characters occuring n-1 times (n is the number of columns you need). It could give you the first possible answer or check all the file.

from collections import Counter
def snif_sep(txt, nbcol, force_all=False):
  pseps = None
  for line in txt.split('\n'):
    if line:
      psep = [k for k,v in Counter(line).items() if v==nbcol-1]
      if pseps is None:
        pseps = set(psep)
      else:
        pseps.intersection_update(psep)
      if len(pseps)==1 and not force_all:
        return pseps.pop()
      if len(pseps)==0:
        return None
  if len(pseps)==1:
    return pseps.pop()
ptitpoulpe
  • 684
  • 4
  • 17
0

Have you thought about using an XML Format? Excel has an XML Format that might be easier to parse and opens easily in excel.

You could also roll you own xml format.

http://msdn.microsoft.com/en-us/library/aa140066(office.10).aspx

    <?xml version="1.0"?>
<?mso-application progid="Excel.Sheet"?>
<Workbook xmlns="urn:schemas-microsoft-com:office:spreadsheet" xmlns:ss="urn:schemas-microsoft-com:office:spreadsheet">
    <Styles>
        <Style ss:ID="sBold">
            <Font ss:Bold="1"/>
        </Style>
        <Style ss:ID="sDate">
            <NumberFormat ss:Format="General Date"/>
        </Style>
    </Styles>
    <Worksheet ss:Name="2100Q is 2009-Nov-11_17_43_13 ">
        <Table>
            <Row>
                <Cell ss:StyleID="sBold">
                    <Data ss:Type="String">Date &amp; Time</Data>
                </Cell>
                <Cell ss:StyleID="sBold">
                    <Data ss:Type="String">Operator ID</Data>
                </Cell>
                <Cell ss:StyleID="sBold">
                    <Data ss:Type="String">Reading Mode</Data>
                </Cell>
                <Cell ss:StyleID="sBold">
                    <Data ss:Type="String">Sample ID</Data>
                </Cell>
                <Cell ss:StyleID="sBold">
                    <Data ss:Type="String">Sample Number</Data>
                </Cell>
                <Cell ss:StyleID="sBold">
                    <Data ss:Type="String">Result</Data>
                </Cell>
                <Cell ss:StyleID="sBold">
                    <Data ss:Type="String">Unit</Data>
                </Cell>
                <Cell ss:StyleID="sBold">
                    <Data ss:Type="String">Notice</Data>
                </Cell>
                <Cell ss:StyleID="sBold">
                    <Data ss:Type="String">Cal.Curve</Data>
                </Cell>
                <Cell ss:StyleID="sBold">
                    <Data ss:Type="String">Cal.Time</Data>
                </Cell>
                <Cell ss:StyleID="sBold">
                    <Data ss:Type="String">Cal.Status</Data>
                </Cell>
                <Cell ss:StyleID="sBold">
                    <Data ss:Type="String">Std. 1 Nom. Value</Data>
                </Cell>
                <Cell ss:StyleID="sBold">
                    <Data ss:Type="String">Std. 1 Act. Value</Data>
                </Cell>
                <Cell ss:StyleID="sBold">
                    <Data ss:Type="String">Std. 2 Nom. Value</Data>
                </Cell>
                <Cell ss:StyleID="sBold">
                    <Data ss:Type="String">Std. 2 Act. Value</Data>
                </Cell>
                <Cell ss:StyleID="sBold">
                    <Data ss:Type="String">Std. 3 Nom. Value</Data>
                </Cell>
                <Cell ss:StyleID="sBold">
                    <Data ss:Type="String">Std. 3 Act. Value</Data>
                </Cell>
                <Cell ss:StyleID="sBold">
                    <Data ss:Type="String">Std. 4 Nom. Value</Data>
                </Cell>
                <Cell ss:StyleID="sBold">
                    <Data ss:Type="String">Std. 4 Act. Value</Data>
                </Cell>
            </Row>
            <Row>
                <Cell ss:StyleID="sDate">
                    <Data ss:Type="DateTime">2009-11-10T11:23:30</Data>
                </Cell>
                <Cell>
                    <Data ss:Type="String">BARBARA</Data>
                </Cell>
                <Cell>
                    <Data ss:Type="String">Normal</Data>
                </Cell>
                <Cell>
                    <Data ss:Type="String">ABC-abc-1234</Data>
                </Cell>
                <Cell>
                    <Data ss:Type="Number">001</Data>
                </Cell>
                <Cell>
                    <Data ss:Type="Number">1.01</Data>
                </Cell>
                <Cell>
                    <Data ss:Type="String">FNU</Data>
                </Cell>
                <Cell>
                    <Data ss:Type="String"/>
                </Cell>
                <Cell>
                    <Data ss:Type="String">StablCal</Data>
                </Cell>
                <Cell ss:StyleID="sDate">
                    <Data ss:Type="DateTime">2009-11-10T10:22:06</Data>
                </Cell>
                <Cell>
                    <Data ss:Type="String">OK</Data>
                </Cell>
            </Row>
            <Row>
                <Cell ss:StyleID="sDate">
                    <Data ss:Type="DateTime">2009-11-10T10:24:15</Data>
                </Cell>
                <Cell>
                    <Data ss:Type="String"/>
                </Cell>
                <Cell>
                    <Data ss:Type="String">Cal.Verification</Data>
                </Cell>
                <Cell>
                    <Data ss:Type="String"/>
                </Cell>
                <Cell>
                    <Data ss:Type="String"/>
                </Cell>
                <Cell>
                    <Data ss:Type="Number">1.01</Data>
                </Cell>
                <Cell>
                    <Data ss:Type="String">FNU</Data>
                </Cell>
                <Cell>
                    <Data ss:Type="String">Verify Cal: Passed</Data>
                </Cell>
                <Cell>
                    <Data ss:Type="String">StablCal</Data>
                </Cell>
                <Cell ss:StyleID="sDate">
                    <Data ss:Type="DateTime">2009-11-10T10:22:06</Data>
                </Cell>
                <Cell>
                    <Data ss:Type="String">OK</Data>
                </Cell>
            </Row>
        </Table>
    </Worksheet>
</Workbook>
Mike Bynum
  • 763
  • 2
  • 10
  • 22
0

Have you considered a TAB separated file instead? It is easily readable and writable by all the software you mention and has given me a lot less trouble than CSV.

Though, I offer an enthusiastic +1 for the idea of making the users edit the file in a known, online editor. Google Docs, Zoho, etc. offer shared files and an ability for you to export the data - which puts you in charge of the format and makes it much easier to parse.

If you go with TSV make sure to scrub the data looking for quoted strings between the quotes? You can always use .strip...

tom stratton
  • 668
  • 7
  • 13
  • tab separated is kind of a no-no; spaces vs tabs causes all sorts of headaches down the road. – GoingTharn Feb 11 '12 at 15:38
  • I get the concept, but in practice for files edited with Excel "type" programs it is VERY hard to put a tab into the field as the program interprets it as a command to move to the next cell. For this use-case I've had strong success with TSV. – tom stratton Feb 11 '12 at 18:25
0

Honestly, if they have files with different formats the simplest solution would be to give them a pulldown that allows them to choose which program they are using. Then have the process tailored for that program.

There is no way to have a single process that is able to cover any and all potential formatting options. But breaking it down this way you can add them as required, and simplify the maintenance of it for yourself.

Drahkar
  • 1,694
  • 13
  • 17
0

Tools that allow you to view or import CSV files have to face this common problem. Tools include Database import tools, excel, open office etc. I know SOFA was made in python and allows csv import.

All these tools have a preview of the data so that the user can make sure it looks ok. At least if the preview looks wrong they can choose the csv delimiter they want to correct it. The tool they use to create the csv file should be consistent throughout, so if it looks ok in he preview, it will probably be ok for the rest of the file. APART from tricky, rare situations where data is escaped or enclosed in quotes.

If the file isn't too big, try create a set of all characters that occur in the file (characters that aren't a-z or 0-9). Now ensure in your preview you contain a line for each of the characters that occur. Then if this part of the preview is messed up, the user can change the quotechars. It is a bit of overhead making a good previewer. You will want to make sure the previewer shows the lines in order with .. to represent lines you have purposefully missed.

If a previewer isn't possible then may god be with you.

Rusty Rob
  • 16,489
  • 8
  • 100
  • 116