1

I wrote a basic program to load a CSV edgelist into a network, calculate 4 metrics for each node in the network, and write the results to a CSV file. I'm using NetworkX and everything has worked fine when using numbers as node ids. However, as I've moved to another example using Twitter usernames as node id's, I get the following error:

Error

UnicodeDecodeError: 'utf8' codec can't decode bytes in position 23-24: invalid continuation byte

Code

import sys
import networkx as nx
import csv


# load CSV edgelist into NetworkX
G = nx.read_edgelist(sys.argv[1], delimiter=',')


# calculate centrality metrics
degree = nx.degree_centrality(G)
between = nx.betweenness_centrality(G)
close = nx.closeness_centrality(G)
eigen = nx.eigenvector_centrality(G)


# write centrality results to a list
centrality = []
for i in G:
 row = i, degree[i], between[i], close[i], eigen[i]
 centrality.append(row)

# write list to CSV
outfile = sys.argv[1].replace('.csv', '_metrics.csv') 
header = 'NodeID', 'Degree', 'Betweenness', 'Closeness', 'Eigenvector'
with open(outfile, 'wb') as f:
 csv.writer(f).writerow(header)
 csv.writer(f).writerows(centrality)
CurtLH
  • 2,329
  • 4
  • 41
  • 64
  • 1
    What encoding is the `.csv` file saved in? [`read_edgelist()` defaults to `'utf-8'`.](http://networkx.github.io/documentation/networkx-1.9/reference/generated/networkx.readwrite.edgelist.read_edgelist.html?highlight=read_edgelist#networkx.readwrite.edgelist.read_edgelist) If the file is saved otherwise, you need to let it know. – Jonathan Lonowski Jun 25 '14 at 19:12
  • What version of python are you using? I assume 2.X. First try placing the following line at the top of your script and let us know: `#-*- coding: utf-8 -*-` – Bee Smears Jun 25 '14 at 19:13
  • You're getting the error in the first non-import line of your program, right? Delete the rest of the program, it's irrelevant. (Then do as @Jonathan says.) – alexis Jun 25 '14 at 21:18
  • @Bee, that shouldn't make any difference unless the unicode is in string literals in the script. – alexis Jun 25 '14 at 21:20
  • @alexis you're totally right. I hadn't read the question closely enough. after reading it, I suggested the Google Drive fix below - assuming OP doesn't know the original encoding and doesn't want to spend time figuring it out. – Bee Smears Jun 26 '14 at 04:52

1 Answers1

1

If you want to fix it quickly and you don't know what your file's characters are encoded as, then I would do the following. I would use google docs to ensure that every character in that file is UTF-8.

Here's how:

  • Navigate to Google Drive / "Create" / "Spreadsheet"
  • Once in the new spreadsheet, click "File" and select "Import"
  • Then select "Upload" followed by "Select a file from your computer"
  • From the 'Import File' dialog box, select "Replace spreadsheet" and (note: the "Detect automatically" option works for me)
  • Once in the imported CSV, select "File" / "Download as" / "Comma Separated Values (CSV. current sheet)"

That's my process to quickly encode every character in a csv as utf-8. Obviously there are many times when it isn't the right answer and when you'll want to understand character encoding; but if what you want is to manipulate the data from your file and not deal with encoding issues for hours at a time, then I've found Google Drive to be the fastest and most reliable solution.

Note: credit to this answer for initially turning me on to this solution.

Community
  • 1
  • 1
Bee Smears
  • 803
  • 3
  • 12
  • 22