Data and API Requests

Common File Formats

Tabular data:

Format Description
.csv Comma-Separated Values
.tsv Tab-Separated Values
.xlsx Excel spreadsheets

Non-tabular data:

Format Description
.txt Plain text
.rtf Rich text format
.xml Markup-based structured data

Images and Binary:

  • .png, .jpg, .tif — image files
  • .dat — generic data files (format depends on source)

Reading Data from Files

Python's built-in open() function helps you access file contents.

with open('file.txt') as file:
    contents = file.read()
print(contents)
  • with open(...) handles file closing automatically.
  • 'r' = read (default), 'w' = write, 'a' = append.
# Reading line by line
with open('file.txt') as file:
    for line in file:
        print(line.strip())

Writing to a file.

with open('newfile.txt', 'w') as file:
    file.write("Hello, world!")

Reading and writing CSV Files.

import csv

# Reading
with open('file.csv', newline='') as f:
    reader = csv.DictReader(f)
    for row in reader:
        print(row['column_name'])

# Writing
with open('output.csv', 'w', newline='') as f:
    writer = csv.writer(f)
    writer.writerow(['name', 'age'])
    writer.writerow(['Adam', 34])
Syntax Description
with open('file.txt', 'w') Opens it as writable
with open('file.txt', 'a') A for append
with open('file.txt', 'r') “r” is default so no need to specify; used to read a file
.read() Grabs the whole document as a single string
.readlines() Single line at a time. It iterates through a document, first time you do it you get first line, second time second line etc.
.readline() Read one line at a time
.seek() Moves to a particular point in the file
.write(“string”) Creates a new file with the contents of string
for line in filename: Iterates through the lines in the file

Working with APIs Using Requests

Python's requests library is your go-to tool for fetching web data. Install it with:

pip install requests

Make get requests with:

import requests

r = requests.get("https://api.example.com/data")
print(r.text)       # Raw response
print(r.json())     # Parsed JSON data
# Saving the parsed JSON data as a variable
data = requests.get("https://api.example.com/data").json()

API queries:

url = "https://api.census.gov/data/2020/acs/acs5?get=NAME,B08303_001E&for=state:*"
response = requests.get(url)
print(response.json())

API Query Structure:

  • After ? are query parameters
  • get=... defines which variables to retrieve
  • for=state:* means "for all states"
  • Separate multiple criteria using commas: &for=state:06,49

Data Collection

  • Primary Collected by you (e.g. surveys, scraping, simulations)
  • Secondary Collected by others and made public (e.g. government databases, open APIs)

When collecting data, ask:

  • What data is needed?
  • How much is enough?
  • Where can it be found?
  • Are there legal or privacy concerns?

Useful resources for scraping and datasets:

Google Sheets as a Data Source

You can use Google Sheets like a cloud-based database using the gspread library and Google Drive API.

Setup Steps:

  1. Enable the Google Sheets API at https://console.cloud.google.com/
  2. Download the JSON credentials and rename it creds.json
  3. Add creds.json to your .gitignore (it contains sensitive data!)
  4. Share your spreadsheet with the client_email in the creds file (Editor access)

Install required libraries:

pip install gspread google-auth

Example usage:

import gspread
from google.oauth2.service_account import Credentials

scope = ["https://spreadsheets.google.com/feeds", "https://www.googleapis.com/auth/drive"]
creds = Credentials.from_service_account_file("creds.json", scopes=scope)
client = gspread.authorize(creds)

sheet = client.open("MySpreadsheet").sheet1
data = sheet.get_all_records()
print(data)