Dataset Preparation, Quality Assurance, and Quality Control

Overview

RSG conducted dataset preparation and quality control procedures at every stage of the study (before, during, and after data collection). These procedures were designed to validate survey logic, review participant experience, and confirm consistent data coding in the survey database. The following sections summarize the various dataset preparation and quality control steps. RSG provided a separate QAQC Plan to the Met Council for each wave of survey collection; these plans include data cleaning details for key elements.

Database Setup and Real-Time Quality Controls

Prior to a survey launch, RSG and the Met Council reviewed the survey instruments to ensure that the survey interface was clear and easy to use, questions were understandable, and variables wrote out to the database as expected. To reduce survey burden and improve final data quality, the survey also included real-time data checks and logic. Examples of these checks include the following:

Validation logic to prevent skipped questions.
Logic checks to hide irrelevant questions and answers (e.g., employment questions for children).
Spatial and temporal checks within trip rosters to prevent overlapping trips.

These real-time data checks do not eliminate every inconsistency, but they do significantly reduce reporting errors and re-coding requirements after data collection.

Geographic Data Checks

During data collection, the survey instruments used the Bing Maps API to geocode the coordinates for reported home, work, school, and trip addresses.

Following data collection, RSG also coded home location points to block groups and broader regional definitions.

Trip Derivation for Nonparticipating Household Members

Household travel surveys require data for all household members to assess complete household travel patterns. However, some exceptions are allowed in the data collection process where travel can be reported by proxy, particularly for children.

Household adults were asked to report travel for the children in the household (under age 18). Participants could also report children of all ages as travel party members on their own trips. RSG used these records to derive diary records for children under age 18.

Completion Criteria

The last step of dataset preparation involved reviewing all data records to confirm that they met survey, travel day, and household completion criteria. Complete households met the following conditions:

The household completed the online recruitment/demographic survey.
All ABS household members provided complete travel diary information (i.e., answered all surveys and reported all trips). Online panel members provided complete travel diary information for themselves (person 1 in the household).
The household reported a home address within the study region.

In 2023, outreach segment households were marked as incomplete because they did not meet criteria 1 and 2: outreach participants completed the survey for themselves, but did not report complete information for their household.

Imputation

Departure Time

In some cases, the rMove™ app may have detected the start of a trip after its true start time, which can yield invalid or extreme values for trip duration and speed. In these cases, the fields depart_date, depart_hour, and depart_minute were adjusted for late pickup conditions using the following approach:

Departure time was imputed using the median speed between all locations along the trip, excluding the origin point, and the distance between the origin and the next point on the trip. For trips with fewer than three recorded locations, imputed departure time is set three minutes earlier than the original departure time to compensate for rMove’s 3-5-minute ping interval. Note that some trips that are the result of split loop trips may only have three or fewer points but will use the imputed depart time from before the loop trip was split and thus may not be included in this rule.
If the imputed departure time overlaps with the previous trip’s arrival time, the previous trip’s arrival time was instead used as the departure time. Regardless of the number of locations along a trip, if the imputed departure time was later than the initially reported departure time, the imputed departure time is set to the original departure time. User-added trips as well as long distance passenger mode trips are also set to the original departure time, as user-added trips are not subject to late pickup conditions, and long-distance passenger modes are often plane trips where all collected traces contain speed information from other modes and thus are less reliable (as rMove™ cannot collect locations when a phone is in airplane mode).

Duration and speed are calculated based on the imputed departure time.

Purpose

Respondents report the purpose of the trip destination in each trip survey. The origin purpose is derived from the destination purpose of the previous trip, except for the first trip in the travel period or where an rMove™ trip occurs after a trip with item non-response. For the first trip in the travel period, the origin purpose can be inferred from begin_day in the day table.

When purpose was not asked because an analyst split a user-reported trip during data cleaning (creating a new destination along a trip), purpose values are derived where possible based on proximity (within 150 meters) to estimated home, work, or school locations. If the location is not proximate to home, work, or school locations, the purpose is set to other.

The purpose category variables (o_purpose_category, d_purpose_category) contain aggregated purpose values based on the type of purpose at the origin/destination of each trip. Dataset users are welcome to perform their own recoding of the purpose categories as well.

Trip purposes have been imputed in cases where a purpose reported by the user is assumed to be inaccurate based on information about that person’s reported habitual locations and other trips (primarily to home, work, and school locations). The trip purpose imputation approach was applied to all rMove™ trips in person-days with at least 1 complete trip and no more than 10 incomplete trips. (Incomplete trips are trips for which the respondent did not answer the trip-specific survey questions about purpose, mode, etc. for the given trip.)

The approach was to apply various tests in logical sequence to trips for which the stated purpose is not consistent with the location type based on the reported habitual locations. In general terms, the tests were designed to:

Check the respondent’s reported destination purpose when it conflicts with the destination location type. (The details of the tests depend on the trip purpose, with different criteria used for change-mode trips, escort trips, linked transit trips, trips with home destinations but other reported purposes, etc.)
Identify cases where respondents swapped the order of two or more trips when reporting their details.
Identify cases where respondents may have omitted a trip and shifted remaining reported trip details by one trip when reporting the rest of their trips.
Fill in missing data by sampling destination purposes from other trips made to the same locations, either by the same respondent or by other respondents.

Mode type (`mode_type`)

Mode_type synthesizes mode_1 to mode_3 down to a single, easier-to-use variable for analytical purposes (so that data users can avoid always referencing all modes on a multimodal trips). Table 2.1 below shows the full crosswalk of which detailed modes correspond to which mode_types in the 2023 data. Higher values of mode_type are prioritized over lower mode_type values in the derivation. For example, transit trips, with mode_type 13, are prioritized over walk trips, with mode_type 1. When transit trips were unlinked using the Google API during cleaning, the non-transit legs of the trip were recoded using Google’s suggested mode (most frequently walk or bike) and do not have a reported mode_1, mode_2, or mode_3.

Table 2.1: Mode Type Hierarchy (2023)

Detailed Mode Value	Detailed Mode Value	Mode Type Value	Mode Type Label
1	Northstar	1	Rail
2	Light rail (e.g., Blue Line, Green Line)	1	Rail
3	Other rail	1	Rail
4	School bus	2	School Bus
5	Bus rapid transit (e.g., A Line, C Line, Red Line)	3	Public Bus
6	Express/commuter bus	3	Public Bus
7	Local bus	3	Public Bus
8	Dial-A-Ride (e.g., Transit Link)	3	Public Bus
9	Metro Mobility	3	Public Bus
10	SouthWest Prime or MVTA Connect	3	Public Bus
11	Employer-provided shuttle/bus	4	Other Bus
12	University/college shuttle/bus	4	Other Bus
13	Other private shuttle/bus (e.g., a hotel's, an airport's)	4	Other Bus
14	Vanpool	4	Other Bus
15	Other bus	4	Other Bus
16	Intercity rail (e.g., Amtrak)	5	Long distance passenger mode
17	Intercity bus (e.g., Greyhound, Jefferson Lines)	5	Long distance passenger mode
18	Airplane/helicopter	5	Long distance passenger mode
19	Uber, Lyft, or other smartphone-app ride service	6	Smartphone ridehailing service
20	Regular taxi (e.g., Yellow Cab)	7	For-Hire Vehicle
21	Other hired car service (e.g., black car, limo)	7	For-Hire Vehicle
22	Household vehicle 1	8	Household Vehicle
23	Household vehicle 2	8	Household Vehicle
24	Household vehicle 3	8	Household Vehicle
25	Household vehicle 4	8	Household Vehicle
26	Household vehicle 5	8	Household Vehicle
27	Household vehicle 6	8	Household Vehicle
28	Household vehicle 7	8	Household Vehicle
29	Household vehicle 8	8	Household Vehicle
30	Other vehicle in household	8	Household Vehicle
31	Other motorcycle in household	9	Other Vehicle
32	Other motorcycle (not my household's)	9	Other Vehicle
33	Car from work	9	Other Vehicle
34	Friend/relative/colleague's car	9	Other Vehicle
35	Rental car	9	Other Vehicle
36	Carpool match (e.g., Waze Carpool)	9	Other Vehicle
37	Carshare service (e.g., Zipcar)	9	Other Vehicle
38	Peer-to-peer car rental (e.g., Turo)	9	Other Vehicle
39	Other vehicle (not my household's)	9	Other Vehicle
61	Electric vehicle carshare (e.g., Evie)	9	Other Vehicle
40	Electric bicycle (my household's)	10	Micromobility
41	Standard bicycle (my household's)	10	Micromobility
42	Borrowed bicycle (e.g., a friend's)	10	Micromobility
43	Bike-share - standard bicycle	10	Micromobility
44	Bike-share - electric bicycle	10	Micromobility
45	Other rented bicycle	10	Micromobility
46	Personal scooter or moped (not shared)	10	Micromobility
47	Scooter-share (e.g., Bird, Lime)	10	Micromobility
48	Moped-share (e.g., Scoot)	10	Micromobility
49	Segway	10	Micromobility
50	Other scooter or moped	10	Micromobility
51	Skateboard or rollerblade	10	Micromobility
52	Other boat (e.g., kayak)	11	Other
53	Vehicle ferry (took vehicle on board)	11	Other
54	Other public ferry or water taxi	11	Other
55	Golf cart	11	Other
56	Snowmobile	11	Other
57	ATV	11	Other
58	Medical transportation service	11	Other
59	Other	11	Other
60	Walk (or jog/wheelchair)	12	Walk
This mode type hierarchy table contains the values from the 2023 dataset; some names for mode types have changed slightly since the 2019 and 2021 surveys. For more information, consult the combined codebook.

iOS Trip Trace Irregularities (Wave 3 2023 Data Only)

The release of iOS 16.4 by Apple on March 27, 2023, brought about significant changes to background location tracking, affecting apps such as rMove, which rely on collecting location information. Consequently, iPhone users with iOS 16.4 or later experienced irregular trip traces within the rMove™ app, impacting data accuracy for Spring 2023.

To address this issue, RSG swiftly updated the rMove™ app and monitoring scripts to mitigate inconsistencies in future data collection. Despite these efforts, the 2023 dataset remained affected. To manage the impact on the dataset and downstream processes, RSG developed a series of criteria to identify suspect trips and flag individuals or households accordingly. Additionally, adjustments in weighting and dataset delivery were made to ensure maximum data utility.

Any suspect trip trace records were identified and the dataset was provided with multiple weights. One set of weights with the full dataset and one set of weights to use if applying this strict criteria, so that any trip analysis metrics could exclude potential trip trace irregularities.

Combined Dataset (2019, 2021 and 2023)

To facilitate analyses across waves of the survey, RSG developed a cross-wave combined dataset and codebook.

Combined Codebook

Note

Download an Excel version of the combined codebook by clicking here.

RSG typically delivers data in its raw form, with the numeric codes that correspond to survey entries instead of the text seen by participants. The codebook allows data users to translate survey results into human-readable format, and is comprised of two parts:

A variable list, which includes attributes at the level of individual survey questions; and
A value labels table, which corresponds to attributes of survey responses.

A combination of manual and scripted processes were used to create a combined codebook:

Variable crosswalk (manual process in Excel). First, a variable crosswalk table was constructed by aligning variable names and survey questions across years. Where variable names differed, but survey question meaning stayed the same, a unified variable name was chosen. Logic, variable descriptions, location of the variable in the database, and other attributes were inspected manually and with a combination of processes in an Excel workbook (i.e., using VLOOKUP processes and other formulas).

Example: The variable fuel in 2019 is renamed fuel_type in 2023. The unified variable name becomes fuel_type.
Value label crosswalk (manual process in Excel). Next, a value label crosswalk was constructed in a similar manner. A crosswalk for numeric value inputs was created for where value labels differed for the same numeric value entry. These unified values and value labels were then used to construct upcoded values and value labels, which consolidated across disparate categories with similar meanings.

Example: Plug-in hybrid (PHEV) and Hybrid (HEV) vehicle fuel types, used in 2019 data, are upcoded to Hybrid (2021, 2023 data).

Table 2.2: Combined Codebook Example: Fuel Type

Values					Labels
2019	2021	2023	Unified	Upcoded	2019	2021	2023	Unified	Upcoded
-9998	NA	NA	-9998	995	Missing: Non-response			Missing	Missing
1	1	1	1	1	Gas	Gas	Gas	Gas	Gas
NA	2	2	2	2		Hybrid (HEV)	Hybrid (HEV)	Hybrid (HEV)	Hybrid
3	3	3	3	2	Hybrid	Plug-in hybrid (PHEV)	Plug-in hybrid (PHEV)	Plug-in hybrid (PHEV)	Hybrid
4	4	4	4	3	Electric	Electric (EV)	Electric (EV)	Electric (EV)	Electric (EV)
2	5	5	5	4	Diesel	Diesel	Diesel	Diesel	Diesel
NA	6	6	6	5		Flex fuel (FFV)	Flex fuel (FFV)	Flex fuel (FFV)	Other
997	7	7	7	5	Other	Other (e.g., natural gas, bio-diesel)	Other (e.g., natural gas, bio-diesel)	Other (e.g., natural gas, bio-diesel)	Other
995	995	NA	995	995	Missing: Skip logic	Missing		Missing	Missing

Combined Dataset

A scripted process, written in R and relying on the combined codebook, was used to create a single dataset containing all three waves of survey data. The scripted process:

Renamed dataset columns (variables) from their year-specific names to the unified names chosen in the combined codebook.
In each column (for each variable), replaced year-specific numeric response codes with their unified response codes.
Repeated step 2 with upcoded response codes, to create an upcoded dataset.

Special Output: Trip Purpose Table

The trip_purpose table was derived from the upcoded and unified trip tables. Its purpose is to aid data analysis of overall trip purposes. This table was developed because the origin and destination purpose categories in the trip table can contain non-intuitive classifications. For example, a summary of destination purposes will have many trips home but the overall trip purpose for that trip home might actually correspond to the non-home trip end (work, school, etc).

Removing trips with a destination of home from the overall analysis is one option, but this can lead to some place types being missing from the final dataset when the trip roster for a person’s day is incomplete. For example, if a person’s trip diary for a day consists of a trip from a friend’s house to their home, the place type friend’s house will be missing from the final summary of trip purposes.

To account for all place types – both origin and destination – in the final trip purpose summaries, the following steps were used to create the trip purpose table:

Transit trips that had been unlinked into access, transit and egress legs were re-linked, by consolidating multiple legs of transit trips into a single record. This removes change mode trips from the table, except for long-distance trips. The trip weight for each linked trip was set to the maximum trip weight its composite unlinked trips.
Trips were placed in two categories: home-based (having one trip end at home) and non-home-based trips.
Home-based trips’ purposes were classified as the non-home end. The weight for this trip purpose record is equal to the original trip weight.
Non-home-based trips were split into two records for each trip: one for the origin end, and a second for the destination end. The weight for each record was set to half of the original trip weight.

The tables below show a hypothetical example of this process for the travel diary corresponding to day_id 199885710201.

In the trip table, there are four records that correspond to this day. The person left home, went to work, went on an exercise trip or to the gym, picked up someone from school, and finally returned home.

Table 2.3: Trips on Day 199885710201

trip_id	o_purpose	d_purpose	trip_weight
1998857102001	Went home	Primary workplace	233.4265
1998857102002	Primary workplace	Exercise or recreation (e.g., gym, jog, bike, walk dog)	363.7046
1998857102003	Exercise or recreation (e.g., gym, jog, bike, walk dog)	Pick-up/drop-off to/from K-12 school or college	363.7046
1998857102004	Pick-up/drop-off to/from K-12 school or college	Went home	363.7046

An analysis of trip purpose by destination place types (d_purpose) would yield an overall trip purpose share of 16.7% trips to work, 33.4% of trips to exercise, 30.2% of trips to escort others to school, and 19.8% of trips to home:

Table 2.4: Trip Destination Purpose Share, Day 199885710201

	d_purpose	trip_weight	purpose_share
	Primary workplace	233.4	17.6%
	Exercise or recreation (e.g., gym, jog, bike, walk dog)	363.7	27.5%
	Pick-up/drop-off to/from K-12 school or college	363.7	27.5%
	Went home	363.7	27.5%
Total	—	1,324.5	100.0%

In the trip purpose table for the same day ID, there are six rows instead of four – the trip from work to exercise, and from exercise to pick-up someone from school, have both been expanded to two rows to allow trip weight to be distributed across them. For the home-based trips, the trip purpose has been assigned to the non-home end of the trip (work and escort, respectively).

Table 2.5: Trips Purpose Records for Day 199885710201

trip_purpose_id	purpose	trip_purpose_weight
181704	Primary workplace	233.4265
181705	Pick-up/drop-off to/from K-12 school or college	363.7046
480995	Primary workplace	181.8523
480996	Exercise or recreation (e.g., gym, jog, bike, walk dog)	181.8523
732042	Exercise or recreation (e.g., gym, jog, bike, walk dog)	181.8523
732043	Pick-up/drop-off to/from K-12 school or college	181.8523

Summarizing the trip purpose table yields 33.4% of trips for work (i.e., one-third of trips are work-related), 31.8% of trips for exercise or recreation, and 34.9% of trips to pick up others from school.

Table 2.6: Trip Purpose Share from Trip Purpose Table, Day 199885710201

	trip_purpose_weight	purpose_share
Primary workplace
	233.4	17.6%
	181.9	13.7%
Subtotal	415.3	31.4%
Pick-up/drop-off to/from K-12 school or college
	363.7	27.5%
	181.9	13.7%
Subtotal	545.6	41.2%
Exercise or recreation (e.g., gym, jog, bike, walk dog)
	181.9	13.7%
	181.9	13.7%
Subtotal	363.7	27.5%
Total	1,324.5	100.0%

Alternatively, the data user could remove trips home and calculate purpose share using the destination purpose, using only the subset of trips that do not end at home. For this travel day, calculating purpose share from the d_purpose for the subset of trips that do not end at home yields a greater share of trips for exercise and pick-up relative to the calculations from the trip purpose table.

Table 2.7: Trip Purpose Share from Trip Table, Non-Home Trips, Day 199885710201

	d_purpose	trip_weight	purpose_share
	Primary workplace	233.4	24.3%
	Exercise or recreation (e.g., gym, jog, bike, walk dog)	363.7	37.9%
	Pick-up/drop-off to/from K-12 school or college	363.7	37.9%
Total	—	960.8	100.0%

Table 2.8 below shows how using destination purpose in the trip table compares to using the overall trip purpose in the trip purpose table. The vast majority of trips home have been re-categorized (a small number remain, where both origin and destination were home, i.e., loop trips without an intermediate stop point). The estimate of total number of trips differs across the two tables as well, because the trip purpose table has consolidated change mode trips into a broader linked trip purpose.

Table 2.8: Trip Purpose Share from Trip and Trip Purpose Table (2023 only)

	purpose/d_purpose	Trip Table, Selected Categories		Trip Purpose Table, All Categories
	purpose/d_purpose	Share	Total	Share	Total
	Home	32.4%	4,339,661	0.1%	10,526
	Shopping	10.8%	1,447,777	16.3%	1,927,275
	Social/Recreation	9.7%	1,305,811	16.0%	1,889,502
	Escort	9.6%	1,285,127	15.6%	1,847,927
	Work	8.2%	1,096,092	13.4%	1,587,782
	Meal	6.9%	929,880	10.4%	1,229,728
	Errand	5.5%	731,313	9.9%	1,167,337
	Work related	5.2%	694,438	7.3%	867,563
	School	3.5%	468,666	5.7%	673,190
	Change mode	3.4%	451,611	NA	NA
	Overnight	2.5%	335,107	3.4%	398,291
	Other	1.9%	258,319	1.2%	146,963
	School related	0.5%	65,032	0.7%	82,248
Total	—	100.0%	13,408,835.3	100.0%	11,828,331.3

Dataset Composition

The final unweighted datasets includes seven distinct data tables. These tables include all user-input survey variables, certain survey metadata (e.g., survey completion mode), and variables derived to support data analysis.

Table 2.9: Dataset Composition, Combined All Years (2019-2023)

Table	Rows
Household	19,170 complete households
Person	38,691 people
Vehicle	30,239 vehicles
Day	157,947 days
Trip	623,926 unlinked trips
Location	12,345,002 points
Trip Purpose	343,372 linked, single-ended trips
	251,366 unlinked, two-ended trips

Overview

Database Setup and Real-Time Quality Controls

Geographic Data Checks

Trip Derivation for Nonparticipating Household Members

Completion Criteria

Imputation

Mode type (mode_type)

iOS Trip Trace Irregularities (Wave 3 2023 Data Only)

Combined Dataset (2019, 2021 and 2023)

Combined Codebook

Combined Dataset

Special Output: Trip Purpose Table

Dataset Composition

Mode type (`mode_type`)