Thursday, November 20, 2014

Python Sets to Align CSV Columns

From recent work, I've found Python set operations to be easier than using Excel VLOOKUP or MATCH formulas.  Here are a few uses:

Visually align two columns from a CSV file, adding the phrase "--MISSING--" into columns where values do not match.

SOMEFILE = '/path/to/some/file.csv'

with open(SOMEFILE, 'r') as csvfile:
                filereader = csv.reader(csvfile, delimiter=',',quotechar='"')
                for row in filereader:
                        set_old.add(row[0])  #column 0

                        set_new.add(row[1])  #column 1

total = sorted(set_old|set_new)
list1 = [x if x in set_old else "--MISSING--" for x in total]
list2 = [x if x in set_new else "--MISSING--" for x in total]

for i in range(0, len(list1)):
                print i, "\t", list1[i], "\t", list2[i]


You can follow this up and pipe the output to "column -t" command in linux to produce an easy to read list.

python script.py | column -t

For example

Column0        Column1
1                     2
2                     3

would become:

Column0        Column1
1                     --Missing--
2                      2
--Missing--      3

It is very easy to pull keys from an existing array or dictionary and perform set operations on them such as comparing lists of users in access or configuration files.

for user in new_users.difference(old_users):
         print user



Here are some other interesting set functions:

A = set(['1','2','3','4'])
B = set(['3','4','5','6'])
C = set(['1','2','3','4','5','6','7','8','9','10'])
D = set(['1','2'])

A.issubset(B)
False

A.issubset(C)
True

A.issuperset(B)
False

A.issuperset(D)
True

A.intersection(B)
['3','4']

A.union(B)
['1','2','3','4','5','6']

A.difference(B)
['1','2']

A.symmetric_difference.B
['1','2','5','6']

Have you used Python sets in Geospatial applications?  Let me know how in the comments.

https://docs.python.org/2/library/sets.html


No comments:

Post a Comment