Dataset Preparation, Quality Assurance, and Quality Control

Overview

RSG conducted dataset preparation and quality control procedures at every stage of the study (before, during, and after data collection). These procedures were designed to validate survey logic, review participant experience, and confirm consistent data coding in the survey database. The following sections summarize the various dataset preparation and quality control steps. RSG provided a separate QAQC Plan to the Met Council for each wave of survey collection; these plans include data cleaning details for key elements.

Database Setup and Real-Time Quality Controls

Prior to a survey launch, RSG and the Met Council reviewed the survey instruments to ensure that the survey interface was clear and easy to use, questions were understandable, and variables wrote out to the database as expected. To reduce survey burden and improve final data quality, the survey also included real-time data checks and logic. Examples of these checks include the following:

  • Validation logic to prevent skipped questions.
  • Logic checks to hide irrelevant questions and answers (e.g., employment questions for children).
  • Spatial and temporal checks within trip rosters to prevent overlapping trips.

These real-time data checks do not eliminate every inconsistency, but they do significantly reduce reporting errors and re-coding requirements after data collection.

Geographic Data Checks

During data collection, the survey instruments used the Bing Maps API to geocode the coordinates for reported home, work, school, and trip addresses.

Following data collection, RSG also coded home location points to block groups and broader regional definitions.

Trip Derivation for Nonparticipating Household Members

Household travel surveys require data for all household members to assess complete household travel patterns. However, some exceptions are allowed in the data collection process where travel can be reported by proxy, particularly for children.

Household adults were asked to report travel for the children in the household (under age 18). Participants could also report children of all ages as travel party members on their own trips. RSG used these records to derive diary records for children under age 18.

Completion Criteria

The last step of dataset preparation involved reviewing all data records to confirm that they met survey, travel day, and household completion criteria. Complete households met the following conditions:

  1. The household completed the online recruitment/demographic survey.
  2. All ABS household members provided complete travel diary information (i.e., answered all surveys and reported all trips). Online panel members provided complete travel diary information for themselves (person 1 in the household).
  3. The household reported a home address within the study region.

In 2023, outreach segment households were marked as incomplete because they did not meet criteria 1 and 2: outreach participants completed the survey for themselves, but did not report complete information for their household.

Imputation

Departure Time

In some cases, the rMove™ app may have detected the start of a trip after its true start time, which can yield invalid or extreme values for trip duration and speed. In these cases, the fields depart_date, depart_hour, and depart_minute were adjusted for late pickup conditions using the following approach:

  • Departure time was imputed using the median speed between all locations along the trip, excluding the origin point, and the distance between the origin and the next point on the trip. For trips with fewer than three recorded locations, imputed departure time is set three minutes earlier than the original departure time to compensate for rMove’s 3-5-minute ping interval. Note that some trips that are the result of split loop trips may only have three or fewer points but will use the imputed depart time from before the loop trip was split and thus may not be included in this rule.
  • If the imputed departure time overlaps with the previous trip’s arrival time, the previous trip’s arrival time was instead used as the departure time. Regardless of the number of locations along a trip, if the imputed departure time was later than the initially reported departure time, the imputed departure time is set to the original departure time. User-added trips as well as long distance passenger mode trips are also set to the original departure time, as user-added trips are not subject to late pickup conditions, and long-distance passenger modes are often plane trips where all collected traces contain speed information from other modes and thus are less reliable (as rMove™ cannot collect locations when a phone is in airplane mode).

Duration and speed are calculated based on the imputed departure time.

Purpose

Respondents report the purpose of the trip destination in each trip survey. The origin purpose is derived from the destination purpose of the previous trip, except for the first trip in the travel period or where an rMove™ trip occurs after a trip with item non-response. For the first trip in the travel period, the origin purpose can be inferred from begin_day in the day table.

When purpose was not asked because an analyst split a user-reported trip during data cleaning (creating a new destination along a trip), purpose values are derived where possible based on proximity (within 150 meters) to estimated home, work, or school locations. If the location is not proximate to home, work, or school locations, the purpose is set to other.

The purpose category variables (o_purpose_category, d_purpose_category) contain aggregated purpose values based on the type of purpose at the origin/destination of each trip. Dataset users are welcome to perform their own recoding of the purpose categories as well.

Trip purposes have been imputed in cases where a purpose reported by the user is assumed to be inaccurate based on information about that person’s reported habitual locations and other trips (primarily to home, work, and school locations). The trip purpose imputation approach was applied to all rMove™ trips in person-days with at least 1 complete trip and no more than 10 incomplete trips. (Incomplete trips are trips for which the respondent did not answer the trip-specific survey questions about purpose, mode, etc. for the given trip.)

The approach was to apply various tests in logical sequence to trips for which the stated purpose is not consistent with the location type based on the reported habitual locations. In general terms, the tests were designed to:

  • Check the respondent’s reported destination purpose when it conflicts with the destination location type. (The details of the tests depend on the trip purpose, with different criteria used for change-mode trips, escort trips, linked transit trips, trips with home destinations but other reported purposes, etc.)
  • Identify cases where respondents swapped the order of two or more trips when reporting their details.
  • Identify cases where respondents may have omitted a trip and shifted remaining reported trip details by one trip when reporting the rest of their trips.
  • Fill in missing data by sampling destination purposes from other trips made to the same locations, either by the same respondent or by other respondents.

Mode type (mode_type)

Mode_type synthesizes mode_1 to mode_3 down to a single, easier-to-use variable for analytical purposes (so that data users can avoid always referencing all modes on a multimodal trips). Table 2.1 below shows the full crosswalk of which detailed modes correspond to which mode_types in the 2023 data. Higher values of mode_type are prioritized over lower mode_type values in the derivation. For example, transit trips, with mode_type 13, are prioritized over walk trips, with mode_type 1. When transit trips were unlinked using the Google API during cleaning, the non-transit legs of the trip were recoded using Google’s suggested mode (most frequently walk or bike) and do not have a reported mode_1, mode_2, or mode_3.

Table 2.1: Mode Type Hierarchy (2023)
Detailed Mode Value Detailed Mode Value Mode Type Value Mode Type Label
1 Northstar 1 Rail
2 Light rail (e.g., Blue Line, Green Line) 1 Rail
3 Other rail 1 Rail
4 School bus 2 School Bus
5 Bus rapid transit (e.g., A Line, C Line, Red Line) 3 Public Bus
6 Express/commuter bus 3 Public Bus
7 Local bus 3 Public Bus
8 Dial-A-Ride (e.g., Transit Link) 3 Public Bus
9 Metro Mobility 3 Public Bus
10 SouthWest Prime or MVTA Connect 3 Public Bus
11 Employer-provided shuttle/bus 4 Other Bus
12 University/college shuttle/bus 4 Other Bus
13 Other private shuttle/bus (e.g., a hotel's, an airport's) 4 Other Bus
14 Vanpool 4 Other Bus
15 Other bus 4 Other Bus
16 Intercity rail (e.g., Amtrak) 5 Long distance passenger mode
17 Intercity bus (e.g., Greyhound, Jefferson Lines) 5 Long distance passenger mode
18 Airplane/helicopter 5 Long distance passenger mode
19 Uber, Lyft, or other smartphone-app ride service 6 Smartphone ridehailing service
20 Regular taxi (e.g., Yellow Cab) 7 For-Hire Vehicle
21 Other hired car service (e.g., black car, limo) 7 For-Hire Vehicle
22 Household vehicle 1 8 Household Vehicle
23 Household vehicle 2 8 Household Vehicle
24 Household vehicle 3 8 Household Vehicle
25 Household vehicle 4 8 Household Vehicle
26 Household vehicle 5 8 Household Vehicle
27 Household vehicle 6 8 Household Vehicle
28 Household vehicle 7 8 Household Vehicle
29 Household vehicle 8 8 Household Vehicle
30 Other vehicle in household 8 Household Vehicle
31 Other motorcycle in household 9 Other Vehicle
32 Other motorcycle (not my household's) 9 Other Vehicle
33 Car from work 9 Other Vehicle
34 Friend/relative/colleague's car 9 Other Vehicle
35 Rental car 9 Other Vehicle
36 Carpool match (e.g., Waze Carpool) 9 Other Vehicle
37 Carshare service (e.g., Zipcar) 9 Other Vehicle
38 Peer-to-peer car rental (e.g., Turo) 9 Other Vehicle
39 Other vehicle (not my household's) 9 Other Vehicle
61 Electric vehicle carshare (e.g., Evie) 9 Other Vehicle
40 Electric bicycle (my household's) 10 Micromobility
41 Standard bicycle (my household's) 10 Micromobility
42 Borrowed bicycle (e.g., a friend's) 10 Micromobility
43 Bike-share - standard bicycle 10 Micromobility
44 Bike-share - electric bicycle 10 Micromobility
45 Other rented bicycle 10 Micromobility
46 Personal scooter or moped (not shared) 10 Micromobility
47 Scooter-share (e.g., Bird, Lime) 10 Micromobility
48 Moped-share (e.g., Scoot) 10 Micromobility
49 Segway 10 Micromobility
50 Other scooter or moped 10 Micromobility
51 Skateboard or rollerblade 10 Micromobility
52 Other boat (e.g., kayak) 11 Other
53 Vehicle ferry (took vehicle on board) 11 Other
54 Other public ferry or water taxi 11 Other
55 Golf cart 11 Other
56 Snowmobile 11 Other
57 ATV 11 Other
58 Medical transportation service 11 Other
59 Other 11 Other
60 Walk (or jog/wheelchair) 12 Walk
This mode type hierarchy table contains the values from the 2023 dataset; some names for mode types have changed slightly since the 2019 and 2021 surveys. For more information, consult the combined codebook.

iOS Trip Trace Irregularities (Wave 3 2023 Data Only)

The release of iOS 16.4 by Apple on March 27, 2023, brought about significant changes to background location tracking, affecting apps such as rMove, which rely on collecting location information. Consequently, iPhone users with iOS 16.4 or later experienced irregular trip traces within the rMove™ app, impacting data accuracy for Spring 2023.

To address this issue, RSG swiftly updated the rMove™ app and monitoring scripts to mitigate inconsistencies in future data collection. Despite these efforts, the 2023 dataset remained affected. To manage the impact on the dataset and downstream processes, RSG developed a series of criteria to identify suspect trips and flag individuals or households accordingly. Additionally, adjustments in weighting and dataset delivery were made to ensure maximum data utility.

Any suspect trip trace records were identified and the dataset was provided with multiple weights. One set of weights with the full dataset and one set of weights to use if applying this strict criteria, so that any trip analysis metrics could exclude potential trip trace irregularities.

Combined Dataset (2019, 2021 and 2023)

To facilitate analyses across waves of the survey, RSG developed a cross-wave combined dataset and codebook.

Combined Codebook

Note

Download an Excel version of the combined codebook by clicking here.

RSG typically delivers data in its raw form, with the numeric codes that correspond to survey entries instead of the text seen by participants. The codebook allows data users to translate survey results into human-readable format, and is comprised of two parts:

  • A variable list, which includes attributes at the level of individual survey questions; and
  • A value labels table, which corresponds to attributes of survey responses.

A combination of manual and scripted processes were used to create a combined codebook:

  1. Variable crosswalk (manual process in Excel). First, a variable crosswalk table was constructed by aligning variable names and survey questions across years. Where variable names differed, but survey question meaning stayed the same, a unified variable name was chosen. Logic, variable descriptions, location of the variable in the database, and other attributes were inspected manually and with a combination of processes in an Excel workbook (i.e., using VLOOKUP processes and other formulas).

    Example: The variable fuel in 2019 is renamed fuel_type in 2023. The unified variable name becomes fuel_type.

  2. Value label crosswalk (manual process in Excel). Next, a value label crosswalk was constructed in a similar manner. A crosswalk for numeric value inputs was created for where value labels differed for the same numeric value entry. These unified values and value labels were then used to construct upcoded values and value labels, which consolidated across disparate categories with similar meanings.

    Example: Plug-in hybrid (PHEV) and Hybrid (HEV) vehicle fuel types, used in 2019 data, are upcoded to Hybrid (2021, 2023 data).

Table 2.2: Combined Codebook Example: Fuel Type
Values Labels
2019 2021 2023 Unified Upcoded 2019 2021 2023 Unified Upcoded
-9998 NA NA -9998 995 Missing: Non-response Missing Missing
1 1 1 1 1 Gas Gas Gas Gas Gas
NA 2 2 2 2 Hybrid (HEV) Hybrid (HEV) Hybrid (HEV) Hybrid
3 3 3 3 2 Hybrid Plug-in hybrid (PHEV) Plug-in hybrid (PHEV) Plug-in hybrid (PHEV) Hybrid
4 4 4 4 3 Electric Electric (EV) Electric (EV) Electric (EV) Electric (EV)
2 5 5 5 4 Diesel Diesel Diesel Diesel Diesel
NA 6 6 6 5 Flex fuel (FFV) Flex fuel (FFV) Flex fuel (FFV) Other
997 7 7 7 5 Other Other (e.g., natural gas, bio-diesel) Other (e.g., natural gas, bio-diesel) Other (e.g., natural gas, bio-diesel) Other
995 995 NA 995 995 Missing: Skip logic Missing Missing Missing

Combined Dataset

A scripted process, written in R and relying on the combined codebook, was used to create a single dataset containing all three waves of survey data. The scripted process:

  1. Renamed dataset columns (variables) from their year-specific names to the unified names chosen in the combined codebook.
  2. In each column (for each variable), replaced year-specific numeric response codes with their unified response codes.
  3. Repeated step 2 with upcoded response codes, to create an upcoded dataset.

Special Output: Trip Purpose Table

The trip_purpose table was derived from the upcoded and unified trip tables. Its purpose is to aid data analysis of overall trip purposes. This table was developed because the origin and destination purpose categories in the trip table can contain non-intuitive classifications. For example, a summary of destination purposes will have many trips home but the overall trip purpose for that trip home might actually correspond to the non-home trip end (work, school, etc).

Removing trips with a destination of home from the overall analysis is one option, but this can lead to some place types being missing from the final dataset when the trip roster for a person’s day is incomplete. For example, if a person’s trip diary for a day consists of a trip from a friend’s house to their home, the place type friend’s house will be missing from the final summary of trip purposes.

To account for all place types – both origin and destination – in the final trip purpose summaries, the following steps were used to create the trip purpose table:

  1. Transit trips that had been unlinked into access, transit and egress legs were re-linked, by consolidating multiple legs of transit trips into a single record. This removes change mode trips from the table, except for long-distance trips. The trip weight for each linked trip was set to the maximum trip weight its composite unlinked trips.
  2. Trips were placed in two categories: home-based (having one trip end at home) and non-home-based trips.
  3. Home-based trips’ purposes were classified as the non-home end. The weight for this trip purpose record is equal to the original trip weight.
  4. Non-home-based trips were split into two records for each trip: one for the origin end, and a second for the destination end. The weight for each record was set to half of the original trip weight.

The tables below show a hypothetical example of this process for the travel diary corresponding to day_id 199885710201.

In the trip table, there are four records that correspond to this day. The person left home, went to work, went on an exercise trip or to the gym, picked up someone from school, and finally returned home.

Table 2.3: Trips on Day 199885710201
trip_id o_purpose d_purpose trip_weight
1998857102001 Went home Primary workplace 233.4265
1998857102002 Primary workplace Exercise or recreation (e.g., gym, jog, bike, walk dog) 363.7046
1998857102003 Exercise or recreation (e.g., gym, jog, bike, walk dog) Pick-up/drop-off to/from K-12 school or college 363.7046
1998857102004 Pick-up/drop-off to/from K-12 school or college Went home 363.7046

An analysis of trip purpose by destination place types (d_purpose) would yield an overall trip purpose share of 16.7% trips to work, 33.4% of trips to exercise, 30.2% of trips to escort others to school, and 19.8% of trips to home:

Table 2.4: Trip Destination Purpose Share, Day 199885710201
d_purpose trip_weight purpose_share
Primary workplace 233.4 17.6%
Exercise or recreation (e.g., gym, jog, bike, walk dog) 363.7 27.5%
Pick-up/drop-off to/from K-12 school or college 363.7 27.5%
Went home 363.7 27.5%
Total 1,324.5 100.0%

In the trip purpose table for the same day ID, there are six rows instead of four – the trip from work to exercise, and from exercise to pick-up someone from school, have both been expanded to two rows to allow trip weight to be distributed across them. For the home-based trips, the trip purpose has been assigned to the non-home end of the trip (work and escort, respectively).

Table 2.5: Trips Purpose Records for Day 199885710201
trip_purpose_id purpose trip_purpose_weight
181704 Primary workplace 233.4265
181705 Pick-up/drop-off to/from K-12 school or college 363.7046
480995 Primary workplace 181.8523
480996 Exercise or recreation (e.g., gym, jog, bike, walk dog) 181.8523
732042 Exercise or recreation (e.g., gym, jog, bike, walk dog) 181.8523
732043 Pick-up/drop-off to/from K-12 school or college 181.8523

Summarizing the trip purpose table yields 33.4% of trips for work (i.e., one-third of trips are work-related), 31.8% of trips for exercise or recreation, and 34.9% of trips to pick up others from school.

Table 2.6: Trip Purpose Share from Trip Purpose Table, Day 199885710201
trip_purpose_weight purpose_share
Primary workplace
233.4 17.6%
181.9 13.7%

Subtotal

415.3 31.4%
Pick-up/drop-off to/from K-12 school or college
363.7 27.5%
181.9 13.7%

Subtotal

545.6 41.2%
Exercise or recreation (e.g., gym, jog, bike, walk dog)
181.9 13.7%
181.9 13.7%

Subtotal

363.7 27.5%
Total 1,324.5 100.0%

Alternatively, the data user could remove trips home and calculate purpose share using the destination purpose, using only the subset of trips that do not end at home. For this travel day, calculating purpose share from the d_purpose for the subset of trips that do not end at home yields a greater share of trips for exercise and pick-up relative to the calculations from the trip purpose table.

Table 2.7: Trip Purpose Share from Trip Table, Non-Home Trips, Day 199885710201
d_purpose trip_weight purpose_share
Primary workplace 233.4 24.3%
Exercise or recreation (e.g., gym, jog, bike, walk dog) 363.7 37.9%
Pick-up/drop-off to/from K-12 school or college 363.7 37.9%
Total 960.8 100.0%

Table 2.8 below shows how using destination purpose in the trip table compares to using the overall trip purpose in the trip purpose table. The vast majority of trips home have been re-categorized (a small number remain, where both origin and destination were home, i.e., loop trips without an intermediate stop point). The estimate of total number of trips differs across the two tables as well, because the trip purpose table has consolidated change mode trips into a broader linked trip purpose.

Table 2.8: Trip Purpose Share from Trip and Trip Purpose Table (2023 only)
purpose/d_purpose Trip Table, Selected Categories Trip Purpose Table, All Categories
Share Total Share Total
Home 32.4% 4,339,661 0.1% 10,526
Shopping 10.8% 1,447,777 16.3% 1,927,275
Social/Recreation 9.7% 1,305,811 16.0% 1,889,502
Escort 9.6% 1,285,127 15.6% 1,847,927
Work 8.2% 1,096,092 13.4% 1,587,782
Meal 6.9% 929,880 10.4% 1,229,728
Errand 5.5% 731,313 9.9% 1,167,337
Work related 5.2% 694,438 7.3% 867,563
School 3.5% 468,666 5.7% 673,190
Change mode 3.4% 451,611 NA NA
Overnight 2.5% 335,107 3.4% 398,291
Other 1.9% 258,319 1.2% 146,963
School related 0.5% 65,032 0.7% 82,248
Total 100.0% 13,408,835.3 100.0% 11,828,331.3

Dataset Composition

The final unweighted datasets includes seven distinct data tables. These tables include all user-input survey variables, certain survey metadata (e.g., survey completion mode), and variables derived to support data analysis.

Table 2.9: Dataset Composition, Combined All Years (2019-2023)
Table Rows
Household 19,170 complete households
Person 38,691 people
Vehicle 30,239 vehicles
Day 157,947 days
Trip 623,926 unlinked trips
Location 12,345,002 points
Trip Purpose 343,372 linked, single-ended trips
251,366 unlinked, two-ended trips